Yahoo! Developer Network Blog
« Previous | Main | Next »
July 25, 2007
Open Source Distributed Computing: Yahoo's Hadoop Support
For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy.
The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy.
It's hard work. And it needs to be commoditized, just like the hardware has been...
We too have been dealing with this at Yahoo. Analyzing petabytes of data takes a lot of CPU power and storage. And given the way our needs (and the web as a whole) have been growing, there will likely be dozens of similarly demanding applications before long.
To build the necessary software infrastructure, we could have gone off to develop our own technology, treating it as a competitive advantage, and charged ahead. But we've taken a slightly different approach. Realizing that a growing number of companies and organizations are likely to need similar capabilities, we got behind the work of Doug Cutting (creator of the open source Nutch and Lucene projects) and asked him to join Yahoo to help deploy and continue working on the [then new] open source Hadoop project.
What started here as a 20 node cluster in March of 2006 was up to nearly 200 a month later and has continued to grow as it eats terabytes and terabytes of data. It wasn't long after that our code contributions back to Hadoop really started to ramp up as well.
Here's a quick timeline of how things have progressed since then...
- 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella
- December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.
- January 2006 - Doug Cutting joins Yahoo!
- February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS.
- March 2006 - Formation of the Yahoo! Hadoop team
- May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes
- April 2006 - Sort benchmark run on 188 nodes in 47.9 hours
- May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark)
- October 2006 - Research cluster reaches 600 Nodes
- December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8
- January 2006 - Research cluster reaches 900 node
- April 2007 - Research clusters - 2 clusters of 1000 nodes
By supporting and contributing to an open source grid computing project, we hope to be part of providing a solid, efficient, and scalable system that anyone can use to attack the types of problems and data sets that are becoming more common on the web. And since it's open source, everyone benefits from the expertise of developers and users around the world. We've already seen similar benefits from our use and support of Apache, PHP, and MySQL (just to name a few).
As we noted last week, Doug and Eric Baldeschwieler (Yahoo's Director of Grid Computing) are presenting Meet Hadoop at the 2007 Open Source Convention this week. While this is one of the first times we're really talking about our involvement with Hadoop in public, it certainly won't be the last.
Looking ahead and thinking about how the economics of large scale computing continue to improve, it's not hard to imagine a time when Hadoop and Hadoop-powered infrastructure is as common as the LAMP (Linux, Apache, MySQL, Perl/PHP/Python) stack that helped to powered the previous growth of the Web. We're already seeing universities begin to teach about Hadoop (University of Washington) and looking at building their own clusters (Carnegie Mellon University).
We're still in the very early days of this revolution and very proud to be part of it.
Jeremy Zawodny
Yahoo! Developer Network
Posted at July 25, 2007 10:30 AM | Permalink
Comments
The link to the Open Source Conference is broken -- I think it should be http://conferences.oreillynet.com/os2007/
Posted by: Chad at July 25, 2007 11:07 AM
Fixed. Thanks, Chad.
Posted by: Jeremy Zawodny at July 25, 2007 11:51 AM
The current sort times are:
20 nodes: 1.2 hours
100 nodes: 1.33 hours
500 nodes: 1.97 hours
900 nodes: 2.5 hours
As you can see, we have made a lot of progress in the last 7 months.
Posted by: Owen O'Malley at July 25, 2007 1:00 PM
Why is it that the the elapsed run time is greater with more nodes? I assume the inter-node communication overhead is greater than the actual workload when high numbers of nodes are invovled.
Posted by: Alex at July 25, 2007 10:29 PM
@Alex : The sort input size (randomized input to be sorted) increases with the number of nodes so that the average size of data sorted per node is similar for say a 20 and 900 node cluster.
Posted by: Gautam at July 26, 2007 6:24 PM
@Alex: Elapsed time being greater is primarily due to the fact that we essentially use the same network (backplane, bandwidth etc.) to do linearly scaling i/o as we scale to hundreds of nodes... *smile*
Posted by: Arun C Murthy at July 26, 2007 10:56 PM
Probably the wrong place to say this, but I don't know where else. It's impossible to sign up for a yahoo maps api key. The page doesn't even load -- "Ubable to Connect".
I go here, http://api.search.yahoo.com/webservices/register_application and get forwarded to: https://developer.yahoo.com/wsregapp/index.php which does not work.
Posted by: Tim at July 27, 2007 12:43 AM
Congrats yahoo! for this achievement. I wish Greek Universities will teach that kind of technologies as well.
Posted by: JohnG at July 27, 2007 6:19 AM
I don't understand how this can be handle. they good way we or i can be fully able to participate is to keep me informed whenever there is a news about the new technoly.
Posted by: Diing Akol at August 15, 2007 11:41 PM
Appreciate the philanthropic act - you too make the world a better place.
(A quick typo noticed - the timeline year published (last but one) should be January 2007 not 2006 :P just my 2 cents)
Posted by: Somasundram at November 16, 2007 8:14 PM
Post a comment
Comment Policy: We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service. Fields marked with asterisk '*' are required.
Subscribe
Recent Blog Articles
view all
YQL Open Table for Google Buzz now live
Tue, 09 Feb 2010
INSERT INTO twitter.status ...
Mon, 08 Feb 2010
Announcing the Yahoo! Brasil Open Hack Day 2010, 20-21 March
Mon, 08 Feb 2010
Marketing hacks, linchpins, and tech women of valor
Sun, 07 Feb 2010
Yahoo! India invites you to join the first India Hadoop Summit
Thu, 04 Feb 2010
Recent Links
Appcelerator Titanium + Yahoo YQL on Vimeo
Mon, 08 Feb 2010
Tue, 02 Feb 2010
PhoneGap | Cross platform mobile framework
Sat, 30 Jan 2010
Web developers can rule the iPad - O'Reilly Radar
Sat, 30 Jan 2010
rc3.org - Is the iPad the harbinger of doom for personal computing?
Thu, 28 Jan 2010
Archives
2010
2009
2008
2007
2006
2005
Recent Readers

