Yahoo! Developer Network Blog

« Previous | Main | Next »


June 10, 2009

Announcing the Yahoo! Distribution of Hadoop

Today we're announcing the general availability of the Yahoo! Distribution of Hadoop, a source-only distribution of Apache Hadoop that we deploy here at Yahoo!.

In my role as quality and release engineering manager for grid technologies at Yahoo!, including Hadoop, I'm really excited about what this release means for the larger Hadoop ecosystem. Here's why:

  1. We're opening up the results of our investment in quality engineering and scale deployments to the Apache Hadoop community and surrounding ecosystem.
  2. We're publishing a frequent source distribution that provides a robust foundation on which others can build and deploy their own enterprise distributions, support, and solutions.
  3. We're committing to keep all of our source code changes for our distributions available as patches in the Apache Hadoop community.

Opening our investment in quality engineering and scale deployments
We spend thousands of machine hours to test each release of Hadoop that we deploy internally. We run automated unit, functional, system, and performance tests over a 2-day period on our 500-machine test cluster. This includes interoperability testing of the cross-cluster data-copying tool (distcp), HDFS and MapReduce benchmarks, and various fault scenarios. All of the unit and performance tests are currently available in Apache Hadoop. We are working towards contributing the functional and system tests back to the community.

We deploy Hadoop on tens of thousands of machines. These machines are divided into a few tiers, each with many large clusters. In order to support internal feature requests and reliability requirements, we test and deploy frequent bug fix and feature releases to an experimental tier of clusters. Once stabilized sufficiently, these releases progress to additional tiers, eventually landing on a production tier, where Hadoop provides a mission critical platform for many core business units at Yahoo! As a release stabilizes and progresses to new tiers, we inevitably discover, fix, test, and deploy new micro releases quickly.

All of this investment in testing and stabilizing Hadoop is now available to anyone.

Providing a robust foundation for other distributions, support, and solutions
This distribution is largely a response to the numerous requests that we have received to share Yahoo!'s internally tested and scale-proven releases. As the pace of Hadoop adoption has increased, so have requests for these releases. The Yahoo! Distribution of Hadoop provides a base for others to build their own distributions, commercial support, and solutions. I believe this will broaden the use of Hadoop and speed its development, growth, and quality, by which we will all benefit. To be clear, this is not a new business for Yahoo!. We will not be providing support or services for our distribution, but we hope that by releasing our internally tested version, third parties will build enterprise support and services on top of our distribution.

Providing all our patches under the Apache License
The pace of our internal releases and the demand for new features has required a number of features to be internally back-ported. With this release, we're committing to contribute back these internally back-ported features to the community and ensure all code in the Yahoo! Distribution of Hadoop is either in the Apache code repository or posted as patches in the Apache Hadoop community.

Hadoop is helping us solve key science and research problems in hours or days instead of months. It provides us a platform to solve extreme problems requiring massive amounts of data processing. It underpins major revenue-generating systems. Opening our distribution enables a faster pace of innovation for the entire Hadoop ecosystem and broadens the use — and ultimately the quality — of this key platform across the industry.

Go get it!

Nigel Daley
Quality and Release Engineering Manager
Yahoo! Grid Technologies

Posted at June 10, 2009 9:30 AM

Bookmark this on Delicious

Comments

Fantastic news- having access to this level of tested/production code will give us much more confidence in moving more of our critical production infrastructure to Hadoop. Way to go guys!

Posted by: Lance Riedel at June 10, 2009 9:57 AM | Permalink

Awesome! Great news. Keep up the good work!

Posted by: Hamlet Khodaverdian at June 10, 2009 8:14 PM | Permalink

Thats good news!! It would be nice if you could highlight the advantages of this version compared to the one given by Apache.

Posted by: Ritesh M Nayak at June 10, 2009 10:26 PM | Permalink

Why fork?

Why not just contribute back to Apache?

Posted by: Aaron at June 14, 2009 8:47 PM | Permalink

This release is good news, thanks! Does your testing include Pig?

Posted by: David Fallside at June 23, 2009 11:41 AM | Permalink

Hadoop will be a great acquisition for yahoo

Posted by: Emagrecer at July 11, 2009 7:05 AM | Permalink

oh nice topic.

Posted by: uggs at October 15, 2009 2:25 AM | Permalink

is it different from Cloudera or Apache distributions of Hadoop? Why didn't you simple make changes to Apache Hadoop?

Posted by: shahryar ghazi at December 4, 2009 6:35 AM | Permalink

shahryar,

All the code in the Y! Distro is in Apache Hadoop mainline or patches on Apache Hadoop Jira. You could think of the Y! Distro as a preview of what's to come in later releases of Apache Hadoop. Plus it's been tested at large scale.

Posted by: Nigel at December 9, 2009 11:00 PM | Permalink

Post a comment

Comment Policy: We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service.

Remember Me?

Hadoop is a trademark of the Apache Software Foundation.

Copyright © 2010 Yahoo! Inc. All rights reserved. Copyright | Privacy Policy

Help us continue to improve the Yahoo! Developer Network: Send Your Suggestions