Yahoo! Developer Network Blog
« Previous | Main | Next »
February 19, 2008
Yahoo! Launches World's Largest Hadoop Production Application
Yahoo! recently launched what we believe is the worlds largest Apache Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.
The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.
Some Webmap size data:
- Number of links between pages in the index: roughly 1 trillion links
- Size of output: over 300 TB, compressed!
- Number of cores used to run a single Map-Reduce job: over 10,000
- Raw disk used in the production cluster: over 5 Petabytes
This process is not new (see the AltaVista connectivity server). What is new is the use of Hadoop. Hadoop has allowed us to run the identical processing we ran pre-Hadoop on the same cluster in 66% of the time our previous system took. It does that while simplifying administration. Further we believe that as we continue to scale up Hadoop, we will be able to scale up our production jobs as needed to larger cluster sizes.
Our team is very excited about the deployment of the Yahoo! Webmap on Hadoop because it demonstrates that although Hadoop is still at a very early stage in its development (perhaps even immature), Hadoop is now capable of handling truly Internet scale projects in a cost effective manner. This and a number of other production system deployments in Yahoo! and other organizations demonstrate that Hadoop is gaining traction in the market and adding real value.
The Yahoo! Grid team has been enhancing and using Hadoop for various research and development tasks since march 2006. We are proud of our role in taking Hadoop from a system that worked on dozens of computers two years ago, to a system that runs on thousands of computers today. The Webmap launch demonstrates the power of Hadoop to solve truly Internet-sized problems and to function reliably in a large scale production setting. We can now say that the results generated by the billions of Web search queries run at Yahoo! every month depend to a large degree on data produced by Hadoop clusters.
For more details about Yahoo!s Webmap project and the work that has gone into scaling Hadoop to support it, see an interview with two long-time colleges of mine, Arnab Bhattacharjee (manager of the Yahoo! Webmap Team) and Sameer Paranjpye (manager of our Hadoop development), embedded above.
Eric Baldeschwieler
Senior Director, Grid Computing
Yahoo! Inc.
Posted at February 19, 2008 7:13 AM
Comments
is "66% faster" normalized with respect to the number of cores used? (that is, were you also previously using 10k cores?). Another interesting figure would be the average core utilization over the cluster over the job. Would you know that one? :-)
Giovanni
Posted by: Giovanni Tummarello at February 19, 2008 11:21 AM | Permalink
66% of the time != 66% faster
Posted by: at February 19, 2008 11:24 AM | Permalink
Wow, the same 10,000 machines? what happened to Yahoo during the switch over?
Posted by: Tim Wintle at February 19, 2008 1:03 PM | Permalink
Awesome! This has got to be sweet.. Congratulations! :)
Posted by: Naveen Koorakula at February 19, 2008 1:16 PM | Permalink
This does run on the same machines. We have a couple of banks to allow us to roll on new software, so the transition was seamless.
(And thanks naveen, it is a fun milestone)
Posted by: Eric Baldeschwieler at February 19, 2008 4:01 PM | Permalink
Congratulations to the Yahoo search and grid computing teams and to the Hadoop community on this fantastic accomplishment!
Posted by: Chad Walters at February 19, 2008 5:33 PM | Permalink
Congratulations!
One question, what's the main difference of hadoop between adopted by yahoo and open source.
Posted by: Mafish at February 19, 2008 5:59 PM | Permalink
Good question Mafish, I would also like to know when/if all internal changes will be released to the publicly available version of Hadoop.
Posted by: Thijs at February 19, 2008 6:26 PM | Permalink
We use the same version of Hadoop internally that's available from the public code repository. We don't have an "internal version" of Hadoop at Yahoo.
Posted by: Jeremy Zawodny at February 19, 2008 6:31 PM | Permalink
Cool man this is going to beat Google soon.
Yahoo is all way the best because its first
Posted by: ARUN REDDY. M at February 19, 2008 7:08 PM | Permalink
Congratulations to the Yahoo Hadoop team on the very impressive Hadoop rollout.
I wonder what OS & Java version Hadoop nodes run on at Yahoo. Is it FreeBSD?
Thanks.
Posted by: Trung Nguyen at February 19, 2008 8:38 PM | Permalink
This deployment was on Linux. We're using Java 6 which has some nice scaling features for servers.
Posted by: Sameer Paranjpye at February 19, 2008 11:40 PM | Permalink
This deployment was on Linux. We're using Java 6 which has some nice scaling features for nio Selectors etc.
Posted by: Sameer Paranjpye at February 19, 2008 11:47 PM | Permalink
Can you look at getting a flash player with better buffering? Requiring a constant 500kb/s is a bit hard for some people's Internet. Usually I hit "play" on something and then "Pause" until it has buffered well ahead but you client doesn't appear to support this so I am getting a very blippy experience. Sorry :(
Posted by: Simon at February 20, 2008 2:41 AM | Permalink
Congratulations for the milestone!
But it's unbelivable you haven't made any changes to hadoop, because many parts of it are unmature.
Posted by: Desertfox at February 20, 2008 10:46 PM | Permalink
If a researcher is interested on running some Web algorithm on, say, an 80 core cluster (scale of a university lab cluster), would a 4 month for a summer intern be sufficient to get it configured, tuned, and working?
Posted by: Mike at February 22, 2008 8:14 PM | Permalink
Mike - it really shouldn't much time at all... please take a look at the documentation at
http://hadoop.apache.org/core/docs/r0.16.0/ and shout out at core-user@hadoop.apache.org if you need any clarifications.
Posted by: Arun C Murthy at February 25, 2008 9:53 AM | Permalink
Mike - it really should be quite easy to setup. Please take a look at the documentation at http://hadoop.apache.org/core/docs/r0.16.0/ and shout out at core-user@hadoop.apache.org if you need any clarifications/help.
Posted by: Arun C Murthy at February 25, 2008 9:54 AM | Permalink
Mike- We built a 64-node cluster at our university in very quickly and had created some sizable programs within three months. Definitely do-able. I've documented some of it on my blog.
Posted by: Jakob Homan at February 28, 2008 2:39 PM | Permalink
Post a comment
Comment Policy: We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service.
Hadoop is a trademark of the Apache Software Foundation.
Subscribe
Recent Blog Articles
view all
Hadoop Bay Area User Group - Feb 17th at Yahoo!, Sunnyvale
Wed, 03 Feb 2010
Comparing Pig Latin and SQL for Constructing Data Processing Pipelines
Fri, 29 Jan 2010
Video from Jan. 20, 2010 Hadoop Bay Area User Group now online
Thu, 28 Jan 2010
Stomping out Java "concurrency cockroaches" with SureLogic's Flashlight and JSure tools
Tue, 26 Jan 2010
Hadoop Bay Area January 2010 User Group - Recap
Thu, 21 Jan 2010
Recent Links
Appcelerator Titanium + Yahoo YQL on Vimeo
Mon, 08 Feb 2010
Tue, 02 Feb 2010
PhoneGap | Cross platform mobile framework
Sat, 30 Jan 2010
Web developers can rule the iPad - O'Reilly Radar
Sat, 30 Jan 2010
rc3.org - Is the iPad the harbinger of doom for personal computing?
Thu, 28 Jan 2010
Archives
Recent Readers

