Hadoop and Distributed Computing at Yahoo!
June 11, 2009
Hadoop Test-Related Issues
I'm getting together with some of the Hadoop committers tomorrow. Considering my quality engineering background, these are some of the discussion items at the top of my mind for the project:- Code Review Guidelines: I wrote these up a couple years ago. Are they being followed? Are they the right set? How can we raise the quality of the code reviews being performed before patches are committed?
- Feature Design Documentation: Can we agree that each feature needs a design doc? A proposed template is attached to HADOOP-5587.
- Feature Test Plans: Can we agree that each feature needs a test plan? A proposed template is also attached to HADOOP-5587.
- Warnings: We're working to reduce static analysis (Findbugs), compiler (javac), and documentation (javadoc) warnings to zero. Can we commit to keeping them there?
- Fault Injection Framework: We're working on a fault-inject framework so that my team and others can write tests that inject faults and monitor the effects. The current work is being contributed on HADOOP-5974. What additional requirements might folks have?
- Usability: Web UI and command lines could use some work to be more consistent and user friendly. Can we agree that no stack trace should, by default, be output to the user when using command line?
- Patch Testing: We've saturated the available hardware for test patches. More hardware is on the way. What problems do we have with the current setup (other than speed)? What improvements can we make?
- Fast Commit Builds: We need a quick (10-minute) build and test target in Hadoop. Once committed, how does this new target fit into the contributor/committer workflow?
- Project Split: This is being tracked as HADOOP-4687. How do we manage build and runtime dependencies between all Hadoop projects?
- TestNG vs Junit4: Should we convert to TestNG to take advantage of some of its unique features, such as data provides and test annotations? This is being tracked in HADOOP-4901.
- True Unit Test: So many of our current JUnit tests are really mini-system tests since they are using MiniMRCluster and MiniDFSCluster to bring up a cluster on a single node in a single process. How do we do better to support and monitor contributions with true unit level tests?
- Testing for Backwards Compatibility: There's a strong desire to get to API and configuration backward compatibility from Hadoop 0.21 forward. After Hadoop 0.21, how do we ensure patches are not breaking backwards compatibility?
Quality and Release Engineering
Yahoo! Cloud Computing
June 10, 2009
Announcing the Yahoo! Distribution of Hadoop
Today we're announcing the general availability of the Yahoo! Distribution of Hadoop, a source-only distribution of Apache Hadoop that we deploy here at Yahoo!. In my role as quality and release engineering manager for grid technologies at Yahoo!, including Hadoop, I'm really excited about what this release means for the larger Hadoop ecosystem. Here's why:- We're opening up the results of our investment in quality engineering and scale deployments to the Apache Hadoop community and surrounding ecosystem.
- We're publishing a frequent source distribution that provides a robust foundation on which others can build and deploy their own enterprise distributions, support, and solutions.
- We're committing to keep all of our source code changes for our distributions available as patches in the Apache Hadoop community.
We spend thousands of machine hours to test each release of Hadoop that we deploy internally. We run automated unit, functional, system, and performance tests over a 2-day period on our 500-machine test cluster. This includes interoperability testing of the cross-cluster data-copying tool (distcp), HDFS and MapReduce benchmarks, and various fault scenarios. All of the unit and performance tests are currently available in Apache Hadoop. We are working towards contributing the functional and system tests back to the community. We deploy Hadoop on tens of thousands of machines. These machines are divided into a few tiers, each with many large clusters. In order to support internal feature requests and reliability requirements, we test and deploy frequent bug fix and feature releases to an experimental tier of clusters. Once stabilized sufficiently, these releases progress to additional tiers, eventually landing on a production tier, where Hadoop provides a mission critical platform for many core business units at Yahoo! As a release stabilizes and progresses to new tiers, we inevitably discover, fix, test, and deploy new micro releases quickly. All of this investment in testing and stabilizing Hadoop is now available to anyone. Providing a robust foundation for other distributions, support, and solutions
This distribution is largely a response to the numerous requests that we have received to share Yahoo!'s internally tested and scale-proven releases. As the pace of Hadoop adoption has increased, so have requests for these releases. The Yahoo! Distribution of Hadoop provides a base for others to build their own distributions, commercial support, and solutions. I believe this will broaden the use of Hadoop and speed its development, growth, and quality, by which we will all benefit. To be clear, this is not a new business for Yahoo!. We will not be providing support or services for our distribution, but we hope that by releasing our internally tested version, third parties will build enterprise support and services on top of our distribution. Providing all our patches under the Apache License
The pace of our internal releases and the demand for new features has required a number of features to be internally back-ported. With this release, we're committing to contribute back these internally back-ported features to the community and ensure all code in the Yahoo! Distribution of Hadoop is either in the Apache code repository or posted as patches in the Apache Hadoop community. Hadoop is helping us solve key science and research problems in hours or days instead of months. It provides us a platform to solve extreme problems requiring massive amounts of data processing. It underpins major revenue-generating systems. Opening our distribution enables a faster pace of innovation for the entire Hadoop ecosystem and broadens the use — and ultimately the quality — of this key platform across the industry. Go get it! Nigel Daley
Quality and Release Engineering Manager
Yahoo! Grid Technologies
May 13, 2009
Hadoop computes the 10^15+1st bit of π
I used Yahoo's Hadoop clusters to compute the 1,000,000,000,000,001st bit of π. The 7 hexadecimal digits of π starting at the 10^15+1 bit are:
- 6216B06
Although Hadoop is primarily used for data-intensive applications, it can also be used to run CPU-intensive jobs on many machines. Computing a range of bytes in π using a BPP-type formula, requires a lot of arithmetic operations and therefore CPU, but not much storage. When computing the 10^15+1st bit of π, the first 30% of the computation was done in idle slots of our Hadoop clusters spread over 20 days. The remaining 70% was finished over a weekend on the Hammer cluster, which was also used for the petabyte sort benchmark.
This validates the results calculated by PiHex, which took more than 2 years on 1734 computers from 56 different countries.
My program was written entirely in Java and ran on Hadoop 0.20. An earlier version is checked in as a Hadoop example named BaileyBorwinPlouffe. The new code will be uploaded soon.
-- Tsz Wo (Nicholas), Sze
May 11, 2009
Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds
We used Apache Hadoop to compete in Jim Gray's Sort benchmark. Jim's Gray's sort benchmark consists of a set of many related benchmarks, each with their own rules. All of the sort benchmarks measure the time to sort different numbers of 100 byte records. The first 10 bytes of each record is the key and the rest is the value. The minute sort must finish end to end in less than a minute. The Gray sort must sort more than 100 terabytes and must run for at least an hour. The best times we observed were:
| Bytes | Nodes | Maps | Reduces | Replication | Time |
|---|---|---|---|---|---|
| 500,000,000,000 | 1406 | 8000 | 2600 | 1 | 59 seconds |
| 1,000,000,000,000 | 1460 | 8000 | 2700 | 1 | 62 seconds |
| 100,000,000,000,000 | 3452 | 190,000 | 10,000 | 2 | 173 minutes |
| 1,000,000,000,000,000 | 3658 | 80,000 | 20,000 | 2 | 975 minutes |
Within the rules for the 2009 Gray sort, our 500 GB sort set a new record for the minute sort and the 100 TB sort set a new record of 0.578 TB/minute. The 1 PB sort ran after the 2009 deadline, but improves the speed to 1.03 TB/minute. The 62 second terabyte sort would have set a new record, but the terabyte benchmark that we won last year has been retired. (Clearly the minute sort and terabyte sort are rapidly converging, and thus it is not a loss.) One piece of trivia is that only the petabyte dataset had any duplicate keys (40 of them).
We ran our benchmarks on Yahoo's Hammer cluster. Hammer's hardware is very similar to the hardware that we used in last year's terabyte sort. The hardware and operating system details are:
- approximately 3800 nodes (in such a large cluster, nodes are always down)
- 2 quad core Xeons @ 2.5ghz per node
- 4 SATA disks per node
- 8G RAM per node (upgraded to 16GB before the petabyte sort)
- 1 gigabit ethernet on each node
- 40 nodes per rack
- 8 gigabit ethernet uplinks from each rack to the core
- Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
- Sun Java JDK (1.6.0_05-b13 and 1.6.0_13-b03) (32 and 64 bit)
We hit a JVM bug that caused a core dump in 1.6.0_05-b13 on the larger sorts (100TB and 1PB) and switched over to the later JVM, which resolved the issue. For the larger sorts, we used 64 bit JVMs for the Name Node and Job Tracker.
Because the smaller sorts needed lower latency and faster network, we only used part of the cluster for those runs. In particular, instead of our normal 5:1 over subscription between racks, we limited it to 16 nodes in each rack for a 2:1 over subscription. The smaller runs can also use output replication of 1, because they only take minutes to run and run on smaller clusters, the likelihood of a node failing is fairly low. On the larger runs, failure is expected and thus replication of 2 is required. HDFS protects against data loss during rack failure by writing the second replica on a different rack and thus writing the second replica is relatively slow.
Below are the timelines for the jobs counting from the job submission at the Job Tracker. The diagrams show the number of tasks running at each point in time. While maps only have a single phase, the reduces have three: shuffle, merge, and reduce. The shuffle is the transfer of the data from the maps. Merge doesn't happen in these benchmarks, because none of the reduces need multiple levels of merges. Finally, the reduce phase is where the final merge and writing to HDFS happens. I've also included a category named waste that represents task attempts that were running, but ended up either failing, or being killed (often as speculatively executed task attempts).



If you compare this years charts to last year's, you'll notice that tasks are launching much faster now. Last year we only launched one task per heartbeat, so it took 40 seconds to get all of the tasks launched. Now, Hadoop will fill up a Task Tracker in a single heartbeat. Reducing that job launch overhead is very important for getting runs under a minute.
As with last year, we ran with significantly larger tasks than the defaults for Hadoop. Even with the new more aggressive shuffle, minimizing the number of transfers (maps * reduces) is very important to the performance of the job. Notice that in the petabyte sort, each map is processing 15 GB instead of the default 128 MB and each reduce is handling 50 GB. When we ran the petabyte with more typical values 1.5 GB / map, it took 40 hours to finish. Therefore, to increase throughput, it makes sense to consider increasing the default block size, which translates into the default map size, to at least up to 1 GB.
We used a branch of trunk with some modifications that will be pushed back into trunk. The primary ones are that we reimplemented shuffle to re-use connections, and we reduced latencies and made timeouts configurable. More details including the changes we made to Hadoop are available in our report on the results.
-- Owen O'Malley and Arun Murthy
May 5, 2009
Hadoop Summit 2009 - Open for registration
This year’s Hadoop Summit is confirmed for June 10th at the Santa Clara Marriott, and is now open for registration.
We have a packed agenda, with three tracks – for developers, administrators, and one focused on new and innovative applications using Hadoop. The presentations include talks from Amazon, IBM, Sun, Cloudera, Facebook, HP, Microsoft, and the Yahoo! team, as well as leading universities including UC Berkeley, CMU, Cornell, U of Maryland, U of Nebraska and SUNY.
From our experience last year with the rush for seats, we would encourage people to register early.
May 4, 2009
Using ZooKeeper to tame system test for large-scale services
In the last release I took on the task of setting up a true system test environment for Apache ZooKeeper. Our previous environment ran the system test in a single JVM instance, which meant that there were some test scenarios that we just couldn't reproduce. In this new environment we wanted to be able to run tests across multiple hosts and deal with different numbers of machines and cluster environments.
My first attempt used ssh and scripts to fire off servers and clients on a cluster of machines. I soon realized that there was a significant amount of hardcoded configuration and envirnment
assumptions that would make the setup inflexible and impractical. So, I started over from scratch.
For the system test I wanted to use some set of machines, M, to host some number of ZooKeeper servers, S, and some number of clients C. Ideally the system test could just discover M at runtime, startup S servers on some subset of M and then startup C clients on the rest. I realized this is a perfect application for ZooKeeper. I also realized that such a use case can apply to more than just system tests.
What I needed to do was very similar to the rendezvous problem: you have two processes that startup with no knowledge of where each other is running and they need to find each other. Common solutions to this involve service discovery by broadcasting on well known ports, but for our network topologies broadcast based solutions will not work. But, ZooKeeper makes rendezvous easy. Each process creates an ephemeral znode with contact information as a child of a previously agreed upon znode (the rendezvous point). For example, let's use "/systest/available" as the rendezvous point. If four hosts startup, /systest/available would have the following children:
/systest/available/host1
/systest/available/host2
/systest/available/host3
/systest/available/host4
I can now see what machines are available for use in our system test.
Now that I know the machines that I can use, I want to start assigning work to them. The Java process that created /systest/available/host1 also created /systest/assignments/host1 and is watching that znode for children to appear. So when I need to assign task1 to a machine and assuming host1 is the least loaded machine, the system test will create the znode /systest/assignments/host1/task1, which will contain information about what class to instantiation and start as well as configuration parameters. The creation of /systest/assignments/host1/task1 will trigger a watch on the agent process running on that machine. The agent running on that machine will read the information about the
new task and start it. I also use the task znode to stop tasks, change their configuration, and get task status.
The end result is a very clean system test environment. The test classes themselves just interact through ZooKeeper, so there aren't any hardcoded assumptions about the environment needed. This also serves as a nice illustration of how Apache ZooKeeper can be used for more than just leader election and locking. Of course the next step would be to combine this scheme with something like OSGi to provide automatic management of the class dependences of the instances.
See the Apache ZooKeeper website to give it a try. The system test describe here is in the latest release: 3.1.1

