Here at Greplin, we've built a lot of our infrastructure on Amazon's Elastic Compute Cloud (EC2) and Elastic Block Store (EBS). After the serious outage last week, some other prominent AWS users are moving off EBSs (or off Amazon's cloud entirely)! There are, of course, risks to building your site on someone else's infrastructure - but using the EC2 and EBS has provided enough benefits for us that it outweighs the downside.
Benchmarks
To start with some hard numbers here are the results of some benchmarks we've run on different types of storage:
Some key take aways (not all of which are represented in the graph):
- RAIDed EBSs are about the fastest disk you can get on EC2 (especially for random I/O).
- RAID0 has considerably better random IO performance than RAID10
- We've rerun our benchmark on different machines and different times, and noticed huge variations in the performance of single EBS drives (from being several times faster than a RAID0, to being several times slower than the numbers shown). RAIDs and Instance storage are more consistent.
- LVM has basically no performance impact
- There is only small performance improvement for RAID0s larger than 4 disks, and RAID10s larger than 8 disks
- Mounting noatime makes almost no difference in these artificial benchmarks or our application specific benchmark
Methodology
All tests were run on c1.medium instances in US-EAST region. The RAID0 and RAID10 arrays both consisted of 16-disks each.
The RAID0 was created with a series of commands like:
While the RAID10 was built with the following commands:
We tested a wide variety of chunk and read ahead sizes, and these numbers seemed to work best for our application.
To generate the benchmark numbers above, we used the following IOZone command:
./iozone -Rb ~/output.xls -s 6g -i 0 -i 1 -i 2 -f /testing/device/mountpoint -r 32k
We also used an application-specific benchmark tool, which generated results substantially proportional to the random I/O numbers above.
Best Practices
- Use LVM, RAID, and XFS (RAID helps smooth out flaky performance, and XFS + LVM is fast, and allows online partion growth)
- Use proper chunk sizes, read ahead, and sunit/swidth (256K chunks and 64K readahead works well for us)
- Make sure your RAID came up on system boot. One of our worst bugs was caused by not noticing a RAID disappeared after a reboot.
- Learn about /proc/diskstats, iostat, iotop, iozone, and bonnie++
- Ideally turn off swap. But, if you must swap, lower the kernel's swappiness setting and put the swap file on an instance store instead of an EBS. Swapping even lightly to an EBS can effectively kill an instance.
- You need to Stop and Start an instance to get it assigned to different hardware
- Use the firewall and/or VPC from the beginning. It's always more painful to setup later.
- Script, monitor, and document everything. Things will fail unexpectedly. Monitoring will help you catch it, and scripts and documentation will help you fix it.
Benefits
Now that we've gotten some of the hard numbers out of the way, let's discuss some of the broader issues with EC2. The major benefit of AWS is that, in combination with the right architecture, we can trivially provision extra capacity with a one line Fabric command. Last month, for example, we got unexpected coverage on a popular Chinese blog at 3AM PST, and had to provision new servers to handle the increased load. Our lives were made much easier since we could just type one line into a terminal and go back to sleep!
EBS drives are touted for their outstanding durability across instance life cycles - which our experience with thousands of drives has cooroborated (I can count on one hand the number of drives that have lost data on us). However, being able to add drives to a machine with a single command adds a lot of flexibility. Sometimes machines run low on disk space - usually caused by software improvements that allow a single machine to handle more data or uneven sharding. The traditional solutions to this problem are to either over-provision disk capacity up front (which can waste quite a bit of money, and still requires fairly accurate disk usage estimates), to shutdown the machine to add more disk capacity, or to move
some existing services between machines to rebalance things (which is non-trivial in the general case). But, since we use EBS drives, we can trivially expand partions online. A simple script will attach additional drives to the machine, automatically build a RAID out of them, add it to the instance's logical volume, and expand the XFS partition that's running low on space - all with absolutely no downtime.
Because of the features and automatability that Amazon allows, we've managed to grow to hundreds servers without a full time sysadmin or dev-ops person (although we're looking to hire the right one :-D). This would have been impossible for a team our size if we had to deal with hardware failures, racking servers, configuring switches, etc. Instead, we've built powerful tools that enable us to automate almost all our infrastructure.
Caveats
Of course, all the magic of AWS doesn't come without a cost. The biggest downside we've encountered is the flakiness of EBS performance. On a good day, we might see 5ms disk seek times - but on a bad day we've seen worse than 200ms. The biggest advantage of a large RAID is that it smooths out EBS p
erformance characteristics. We've sometimes observed single EBS disks outperforming a 16 drive RAID0 - but we've also seen single disks slow to the point of being useless. Big RAIDs will be consistently and predictably mediocre.
Second, almost any EBS related commands can fail for no apparent reason. One of the worst offenders is that ~2% of the time time you try to attach an EBS disk to an instance, it will just hang (and say 'attaching ...' for hours). The only workaround is to choose a different device name for the drive - no drive will work under that device name for a few hours.
A less annoying issue, that you're probably already aware of, is that you will lose individual instances. Your architecture must be able to survive the loss of any one machine without bringing the site down (we haven't been brave enough to try the Chaos Monkey approach - but we've lived through several trials by fire). The unfortunate thing is that the failure behaviour isn't particularly consistent. The AWS console may not show the instance as being dead until an hour after it has actually died. Or there may be a half hour wait to shutdown the broken machine or detach its EBS (a 'shutdown'/'start' cycle usually gets you new underlying hardware. A simple reboot usually will not).
Finally, we've (rarely) run into capacity issues in a particular availability zone. Either have enough capacity around that you'll survive if you can't provision a particular instance type for a day or so, or be flexible in which availability zone you're willing to bring new instances up in.
We're still pretty new to AWS, so please share any tips, tricks, or bugs you've discovered!