Greplin Zookeeper Utils

We at Greplin are huge fans of Apache's excellent Zookeeper proejct, which allows you to setup a cluster ('ensemble' in Zookeeper talk) of servers that can present a very fault-tolerant filesystem like abstraction to clients.

Zookeeper isn't meant to store significant amounts of data (in fact, the entire 'file system' must fit in memory and all writes are funneled through a single dynamically elected 'master') - but it's perfect for building distributed coordination, notification, locking, and configuration systems. The details are beyond the scope of this announcement, but I'd highly recommend Konar and Reed's introduction.

However, the filesystem abstraction provided by Zookeeper is a bit low level for many common tasks (like implementing leader elections or distributed transactions/locking). To that end, Apache has created a list of recipes (similar to 'patterns') for Zookeeper that explain common problems and document solutions.

At Greplin we have implemented some of these solutions (notably locking and leader elections), and are happy to open source them in our greplin-zookeeper-utils. The package also adds a simple way to programatically and quickly start/stop single node zookeeper ensembles - which is sometimes useful for testing.

This package is still new and growing - but we hope it's useful to others too!

A quick demonstration:

Code independence day - Nagios utilities, OOM diagnostics, and "hop"

For July 4th this year, we celebrated as many people in America do: by overeating, enjoying fireworks, and watching an amazing movie.

We decided it would also be fun to grant independence to some of our code.  Today we're open sourcing:

  • polarbear - a Java tool that helps diagnose OutOfMemoryError conditions.
  • greplin-nagios-utils - Python-based DSL + utilities for writing and running Nagios checks
  • hop - a ridiculously easy way to navigate around your favorite directories

We hope you find these useful - as always we look forward to comments, suggestions, and pull requests.

AWS Benchmarks and Best Practices

Here at Greplin, we've built a lot of our infrastructure on Amazon's Elastic Compute Cloud (EC2) and Elastic Block Store (EBS). After the serious outage last week, some other prominent AWS users are moving off EBSs (or off Amazon's cloud entirely)! There are, of course, risks to building your site on someone else's infrastructure - but using the EC2 and EBS has provided enough benefits for us that it outweighs the downside.

 

Benchmarks

To start with some hard numbers here are the results of some benchmarks we've run on different types of storage:

 

Ebs-bench_2

Some key take aways (not all of which are represented in the graph):

  • RAIDed EBSs are about the fastest disk you can get on EC2 (especially for random I/O).
  • RAID0 has considerably better random IO performance than RAID10
  • We've rerun our benchmark on different machines and different times, and noticed huge variations in the performance of single EBS drives (from being several times faster than a RAID0, to being several times slower than the numbers shown). RAIDs and Instance storage are more consistent.
  • LVM has basically no performance impact
  • There is only small performance improvement for RAID0s larger than 4 disks, and RAID10s larger than 8 disks
  • Mounting noatime makes almost no difference in these artificial benchmarks or our application specific benchmark

 

Methodology

All tests were run on c1.medium instances in US-EAST region. The RAID0 and RAID10 arrays both consisted of 16-disks each.

The RAID0 was created with a series of commands like:

 

While the RAID10 was built with the following commands:

We tested a wide variety of chunk and read ahead sizes, and these numbers seemed to work best for our application.

To generate the benchmark numbers above, we used the following IOZone command:

./iozone -Rb ~/output.xls -s 6g -i 0 -i 1 -i 2 -f /testing/device/mountpoint -r 32k

We also used an application-specific benchmark tool, which generated results substantially proportional to the random I/O numbers above.

 

Best Practices

  • Use LVM, RAID, and XFS (RAID helps smooth out flaky performance, and XFS + LVM is fast, and allows online partion growth)
  • Use proper chunk sizes, read ahead, and sunit/swidth (256K chunks and 64K readahead works well for us)
  • Make sure your RAID came up on system boot. One of our worst bugs was caused by not noticing a RAID disappeared after a reboot.
  • Learn about /proc/diskstats, iostat, iotop, iozone, and bonnie++
  • Ideally turn off swap. But, if you must swap, lower the kernel's swappiness setting and put the swap file on an instance store instead of an EBS. Swapping even lightly to an EBS can effectively kill an instance.
  • You need to Stop and Start an instance to get it assigned to different hardware
  • Use the firewall and/or VPC from the beginning. It's always more painful to setup later. 
  • Script, monitor, and document everything. Things will fail unexpectedly. Monitoring will help you catch it, and scripts and documentation will help you fix it.

 

 

Benefits

Now that we've gotten some of the hard numbers out of the way, let's discuss some of the broader issues with EC2. The major benefit of AWS is that, in combination with the right architecture, we can trivially provision extra capacity with a one line Fabric command. Last month, for example, we got unexpected coverage on a popular Chinese blog at 3AM PST, and had to provision new servers to handle the increased load. Our lives were made much easier since we could just type one line into a terminal and go back to sleep!

EBS drives are touted for their outstanding durability across instance life cycles - which our experience with thousands of drives has cooroborated (I can count on one hand the number of drives that have lost data on us). However, being able to add drives to a machine with a single command adds a lot of flexibility. Sometimes machines run low on disk space - usually caused by software improvements that allow a single machine to handle more data or uneven sharding. The traditional solutions to this problem are to either over-provision disk capacity up front (which can waste quite a bit of money, and still requires fairly accurate disk usage estimates), to shutdown the machine to add more disk capacity, or to move

 

some existing services between machines to rebalance things (which is non-trivial in the general case). But, since we use EBS drives, we can trivially expand partions online. A simple script will attach additional drives to the machine, automatically build a RAID out of them, add it to the instance's logical volume, and expand the XFS partition that's running low on space - all with absolutely no downtime. 

Because of the features and automatability that Amazon allows, we've managed to grow to hundreds servers without a full time sysadmin or dev-ops person (although we're looking to hire the right one :-D). This would have been impossible for a team our size if we had to deal with hardware failures, racking servers, configuring switches, etc. Instead, we've built powerful tools that enable us to automate almost all our infrastructure.

 

Caveats

Of course, all the magic of AWS doesn't come without a cost. The biggest downside we've encountered is the flakiness of EBS performance. On a good day, we might see 5ms disk seek times - but on a bad day we've seen worse than 200ms. The biggest advantage of a large RAID is that it smooths out EBS p

 

erformance characteristics. We've sometimes observed single EBS disks outperforming a 16 drive RAID0 - but we've also seen single disks slow to the point of being useless. Big RAIDs will be consistently and predictably mediocre.

Second, almost any EBS related commands can fail for no apparent reason. One of the worst offenders is that ~2% of the time time you try to attach an EBS disk to an instance, it will just hang (and say 'attaching ...' for hours). The only workaround is to choose a different device name for the drive - no drive will work under that device name for a few hours.

A less annoying issue, that you're probably already aware of, is that you will lose individual instances. Your architecture must be able to survive the loss of any one machine without bringing the site down (we haven't been brave enough to try the Chaos Monkey approach - but we've lived through several trials by fire). The unfortunate thing is that the failure behaviour isn't particularly consistent. The AWS console may not show the instance as being dead until an hour after it has actually died. Or there may be a half hour wait to shutdown the broken machine or detach its EBS (a 'shutdown'/'start' cycle usually gets you new underlying hardware. A simple reboot usually will not).

Finally, we've (rarely) run into capacity issues in a particular availability zone. Either have enough capacity around that you'll survive if you can't provision a particular instance type for a day or so, or be flexible in which availability zone you're willing to bring new instances up in.

 

 

 

We're still pretty new to AWS, so please share any tips, tricks, or bugs you've discovered!

Lucene Utilities and Bloom Filters

As you may remember, at Greplin we have built some of our search features on top of the excellent Lucene project. As avid users, we've built a fair number of tools that help us use Lucene to the fullest. Today we're happy to announce that we'll be open sourcing a few more of them in the greplin-lucene-utils GitHub project.

Some noteworthy features include:

  • A class that construct BooleanQueries in an Effective Java-style Builder pattern 
  • A query that matches no documents (sounds odd, but it's come in handy more than once for us)
  • A query that matches phrase prefixes - for example "Epic w" will match both "Epic win" and "Epic wonder". This is particularly useful for implementing Google Instant style searches
  • A Collector that collects all matching documents in no specified order.
  • A Collector that only collects the first matching document.
  • A Collector that counts all matching documents.

 

We're also excited to introduce an open source persistent Bloom Filter implementation - the greplin-bloom-filter project. Storing 50,000 2.5KB items in a traditional hash set requires over 125MB. But if you're willing to accept a 1-in-10,000 false positive rate on lookups, a comparable counting bloom filter requires under 500KB (over 250x space savings) - see our benchmark for details. Some noteworthy implementation details:

  • Instead of using N-distnct hashes, we use a linear combination of two hashes (as pioneered by Kirsch and Mitzenmacher)
  • Our base hash is a Java implementation of Appleby's MurmurHash (maybe an ambitious contributor can try to build a Java implementation of Google's CityHash though!)
  • We use 4-bit counting buckets, which means items can be deleted from the BloomFilter
  • Our bloom filters can be efficiently written to disk to support persistence

 

As always, pull-requests, issue reports, and comments are welcome!

Utilities for Tornado and Twisted

At Greplin, we love us some asynchronous Python.  Our web site is built using Tornado and our API clients are built using Twisted.

Being heavy users of these great frameworks, we found ourselves creating some general use utilities that we're excited to open source today:

As always: we love pull requests, issue reports, and comments!

Fast Python protocol buffers

We send a lot of data between different services - so much data that we actually found serialize and deserialize time with JSON to be a bottleneck.  We decided to use compact, fast protocol buffers instead - only to learn that in Python the default protocol buffers implementation is actually up to 10 times slower than just using JSON!

With that in mind, we're really proud to open source fast Python protocol buffers, a thin wrapper over C++ protocol buffers that is 10-15x faster than the default Python implementation and still has a nice Pythonic API.  Enjoy!

Results of the Greplin Programming Challenge

Two weeks ago we turned some of our favorite interview questions in to an online programming challenge.  We tried to make it more fun by adding levels, a "Where in the World is Carmen Sandiego?" style typewriter effect, and a call to an automated phone number (more on this later).

We thought people would like it, but we really underestimated how much!  The challenge actually has more Hacker News upvotes than our private beta launch!  While completing the challenge is neither necessary nor sufficient to get hired at Greplin, this has been by far our most active recruiting channel.

Some things we did right:

  • We launched it on a Friday.  At least a fifth of our email responses mentioned how nice it was to have something fun to do at work on a Friday.
  • People really appreciated the typewriter effect and general spy feel.
  • We made it doable - the challenge was approachable by people with many levels of experience.

Some things we could have done better:

  • Turns out it's not a good idea to make people call a random phone number.  We probably should have mentioned it was an automated line and that there wasn't going to be one of us on the other end.  Even then, it's a bad idea internationally. We removed the phone call requirement about 20 minutes after launching the challenge.
  • We should have had a harder "bonus" level.  Fortunately, someone did this anyways.

Overall, we were thrilled with the response and hope to release another challenge soon.

 

Solutions

The languages and methods varied greatly.  Many people mentioned brute force was enough for these problems, some people wrote the great algorithm anyway.  Some people just looked the answers up.

 

Longest Palindrome

A reasonably simple O(n^2) solution for palindrome finding iterates over 0 and 1 character palindrome centers and expands each out as far as it can go:

 

Some people added to the challenge by picking nontraditional languages.  One great example was the T-SQL solution from Tim Lehner:

 

There is also an O(n) solution.

 

Fibonacci Primes

Rodrigo Farnham sent a nice, very readable solution in Python:

 

Subset Sums

David Koblas sent us a one line (+ 1 import line + 1 input line) solution:

 

These are just a few of the hundreds of great solutions we received.  Feel free to post your own solution in the comments.  Also, the first person to solve all three problems in any of these languages gets a free lifetime Premium membership.