Lucene Utilities and Bloom Filters

As you may remember, at Greplin we have built some of our search features on top of the excellent Lucene project. As avid users, we've built a fair number of tools that help us use Lucene to the fullest. Today we're happy to announce that we'll be open sourcing a few more of them in the greplin-lucene-utils GitHub project.

Some noteworthy features include:

  • A class that construct BooleanQueries in an Effective Java-style Builder pattern 
  • A query that matches no documents (sounds odd, but it's come in handy more than once for us)
  • A query that matches phrase prefixes - for example "Epic w" will match both "Epic win" and "Epic wonder". This is particularly useful for implementing Google Instant style searches
  • A Collector that collects all matching documents in no specified order.
  • A Collector that only collects the first matching document.
  • A Collector that counts all matching documents.

 

We're also excited to introduce an open source persistent Bloom Filter implementation - the greplin-bloom-filter project. Storing 50,000 2.5KB items in a traditional hash set requires over 125MB. But if you're willing to accept a 1-in-10,000 false positive rate on lookups, a comparable counting bloom filter requires under 500KB (over 250x space savings) - see our benchmark for details. Some noteworthy implementation details:

  • Instead of using N-distnct hashes, we use a linear combination of two hashes (as pioneered by Kirsch and Mitzenmacher)
  • Our base hash is a Java implementation of Appleby's MurmurHash (maybe an ambitious contributor can try to build a Java implementation of Google's CityHash though!)
  • We use 4-bit counting buckets, which means items can be deleted from the BloomFilter
  • Our bloom filters can be efficiently written to disk to support persistence

 

As always, pull-requests, issue reports, and comments are welcome!