Skip to content


Google App Engine – Full Text Search with JDO – Revisited

Objective

This article will show you how to implement a full text search in Google App Engine using JDO. I tried my hand at this couple month ago, but after watching this presentation I decided to do it properly.

The Problem

In my first attempt I managed to get the search working, but after watching Brett Slatkin’s presentaion I realized where the problem is. In short deserializing a list of strings (which is our search index) is a very costly operation, but he presented with a solution. Bellow you will find my solution to this problem.

Data Model

For this example we will use such data model: we have Customer (name, contact, notes) which has a list of Addresses and Phones. We need ability to find customer by name, address or phone.

@PersistenceCapable(identityType = IdentityType.APPLICATION, detachable="true") public class Customer {     @PrimaryKey     @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)     private Long id;     @Persistent     private String name;     @Persistent     private String contactName;     @Persistent     private String comments;     @Persistent(mappedBy = "customer")     @Element(dependent = "true")     private List<Address> addresses = new ArrayList<Address>();     @Persistent(mappedBy = "customer")     @Element(dependent = "true")     private List<Phone> phones = new ArrayList<Phone>();     @Persistent(dependent="true")     private CustomerIndex index;    // getters and setter go here.... } @PersistenceCapable(identityType = IdentityType.APPLICATION, detachable="true") public class Address {     @PrimaryKey     @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)     private Key id;     @Persistent     private String type;     @Persistent     private String line1;     @Persistent     private String line2;     @Persistent     private String city;     @Persistent     private String state;     @Persistent     private String zip;     @Persistent     private Customer customer;    // getters and setter go here.... } @PersistenceCapable(identityType = IdentityType.APPLICATION, detachable="true") public class Phone {    @PrimaryKey     @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)     private Key id;     @Persistent     private String type;     @Persistent     private String phone;     @Persistent     private Customer customer;    // getters and setter go here.... }

If you paid attention you notice that we have an interesting child in the Customer class called CustomerIndex. Here it is:

@PersistenceCapable(identityType = IdentityType.APPLICATION, detachable="true") public class CustomerIndex {     @PrimaryKey     @Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)     private Key id;     @Persistent     private Set<String> index;    // getters and setter go here.... }

Search Approach

Here is the theory of what we gonna do: Since deserializing of List properties is a very very costly operation (and we do not care abut the data it holds anyway) we move customer search index property into a Child object. We will perform a search on this Child and we will get only the keys of the child objects. This way we do not have to incur the penalty of deserializing our search index (the search happens on the index). Once we have our child object keys we will load Parent objects with those keys. We can do this because a child key is a composite key and always includes parent key.
To make our search more usable we will use Lucenen and SnowballAnalyzer for word stemming.
Here is the method that gives us the Set of words. We use it to generate the index of searchable words as well as search phrases.

protected Set<String> getIndex( String input, int maxTokens ) {   Set<String> returnSet = new HashSet<String>();   try {     Analyzer analyzer =  new SnowballAnalyzer( org.apache.lucene.util.Version.LUCENE_30,"English", stopWords());     TokenStream tokenStream = analyzer.tokenStream( "content", new StringReader(input) );     while ( tokenStream.incrementToken() && (returnSet.size() < maxTokens) ) {       if( tokenStream.hasAttribute( TermAttribute.class ) ) {         TermAttribute attr = tokenStream.getAttribute( TermAttribute.class );         logger.debug( attr.term() );         returnSet.add( attr.term() );       }     }   }catch( Exception exc ) {     logger.equals(exc);   }   return returnSet; }

Here is our search method:

public List<Customer> searchCustomers( String search1, Long entityId ) throws IOException {   PersistenceManager pm = PMF.getManager();   Set<String> search = getIndex(search1, 3);   Query query = pm.newQuery("SELECT id FROM " + CustomerIndex.class.getName() );   query.setFilter("index == param0");   query.declareParameters("String param0");   Query query2 = pm.newQuery(Customer.class);   query2.setFilter("id == keyParam");   query2.declareParameters("com.google.appengine.api.datastore.Key keyParam");   List<Customer> custs = null;   List<Key> keys;   List<Key> parents = new ArrayList<Key>();   try {     keys = (List<Key>) query.execute( search );     for( Key k : keys){       parents.add( k.getParent() );   }   custs = (List<Customer>) query2.execute( parents );   for( Customer cust : custs ) {     for( Address addr : cust.getAddresses() )       logger.debug( addr.getId() );     for( Phone ph : cust.getPhones() )       logger.debug( ph.getId() );     }   } catch ( Exception exc ) {     logger.error(exc);   } finally {     query.closeAll();     query2.closeAll();     pm.close();   }   return custs; }

You will notice that we walk the address and Phone lists for each customer to load them form Storage. We do that so we can ship them over the wire. UI in this case is a Flex client, so we do JSON serialization of the results.

Conclusion

Text searching can be implemented in GAE and to boot it can be implemented efficiently. Just remember before you store this Customer record you need to build out the CustomerIndex object with the set of words that we will search on. I just concat all the properties to one string and Lucene build the set for me by calling my getIndex().

  • Facebook
  • Twitter
  • Digg
  • del.icio.us
  • Reddit
  • Google Bookmarks
  • LinkedIn
  • Slashdot
  • MySpace
  • Propeller
  • StumbleUpon
  • Yahoo! Buzz
  • Add to favorites
  • email
  • Yahoo! Bookmarks
  • Live
  • FriendFeed
  • Technorati

Posted in Google App Engine.

Tagged with , .


2 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. GeneNo Gravatar says

    I like this example. It is obviously very comprehensive and a big help for myself. So thanks for that.

    I am have downloaded the lucene package 3.0.2. but I am having trouble download the SnowballAnalyzer code. Could you please supply a location for the jar file?

  2. GeneNo Gravatar says

    Ignore me.. it is in the lucene 3.0.2 package under: /lucene-3.0.2/contrib/snowball

    Thanks!



Some HTML is OK

or, reply to this post via trackback.