Welcome, guest Sign In

Hadoop and Distributed Computing at Yahoo!

October 28, 2009

Slides from Hadoop World and University Talks

Here are the slides from my recent talks at Hadoop World 2009, New York, and at UIUC and Carnegie Mellon the following week. These outline how Yahoo! uses Hadoop. The university talk includes learnings from building and deploying Hadoop at Yahoo!.

Hadoop World:

University Talk: (Audio from UIUC)

eric14 a.k.a. Eric Baldeschwieler
VP Hadoop Software Development @ Yahoo!

Comments (0) | Permalink

October 23, 2009

Hadoop User Group (HUG) – Oct 21st at Yahoo!

Thanks everyone for joining us last Wednesday night for the Hadoop user group. There were close to a hundred attendees! I’m excited to see that Hadoop and its related technologies creates such a buzz in The Valley. We had a full agenda, thanks for the presenters and the courage to do a live demo :).

Hong Tang from Yahoo! walked us through Mamuk, a simulator for Large-scale Distributed System Verification & Debugging.

Philip Zeyliger from Cloudera presented the cool beta release of Cloudera Desktop

And last, but certainly not least, Shevek from Karmasphere gave an energetic review on the Karmasphere Studio™ for Hadoop.

Hope to see you all and many new faces in the next HUG on Nov 18th (with cold beers this time...).

As always, we are looking for exciting technologies/ideas and experiences you want to share.

Please email presentation requests to dekel at yahoo hyphen inc dot com

Cheers,

Dekel

Comments (1) | Permalink

M45 Enables Web-Scale Information Extraction Research

About us
We are PhD students at Carnegie Mellon in the Machine Learning Department and the Language Technologies Institute, and our thesis work is part of the Read the Web project, which is led by Professor Tom Mitchell. The goal of our project is to build a system that can start from a limited amount of knowledge and gradually learn to understand language by utilizing millions of web pages. The project serves as a great case study for developing new Machine Learning and Natural Language Processing techniques. We use the Yahoo! M45 Cluster with Hadoop extensively in our work, and this post gives an example of how we use M45 and why it is essential to our research project.

Fact extraction using statistics on patterns and instances
Our algorithms learn to extract facts from the web by using statistics from a large collection of web pages, and start from initial input that consists of categories (e.g., Company, City, Sports Team) and relations (e.g., CompanyHeadquarteredInCity(Company, City), SportsTeamHomeCity(SportsTeam, City)) of interest, and 15 "seed" instances of each category and relation (e.g., Yahoo! and IBM are instances of the Company category, and the pair (White Sox, Chicago) is an instance of the SportsTeamHomeCity relation). Our system learns new facts by discovering contextual patterns like "treaty with X" (which is good evidence that something is a country) and "X corporate headquarters in Y". The system learns patterns like these by observing that, for example, a seed instance like Yahoo! and Sunnyvale fill the X and Y slots in the pattern. It also uses learned patterns to discover new pairs of noun phrases that might satisfy the same relation. Then it ranks candidate patterns and instances using statistics about how often they co-occur with each other, because patterns are rarely perfect extractors of facts. For example, "treaty with Lincoln" occurs 5 times on the web, and "treaty with Japan" occurs about 65,000 times, so mistakes happen, but mistakes have smaller counts for useful patterns. (However, "treaty with aliens" occurs 168,000 times... Maybe this wasn't an ideal example... :) )

Before M45: Hit counts
At a small scale, it is possible to run such algorithms using search APIs like Yahoo!'s BOSS service. However, to get accurate statistics of how many times each candidate pattern occurs with each candidate pair of noun phrases, a program would need to get the hit count of the string that results from plugging the noun phrases into the pattern. If we implemented this using individual hit count queries, for 10,000 patterns and 10,000 noun phrase pairs, 10,000^3 hit count queries would be necessary to find the number of times each pair of noun phrases occurs with each pattern. At one query per second, this would take over 30,000 years. In our work, this ruled out using search APIs to gather the necessary statistics from the web.

Enter M45, stage West
This is where the M45 computing cluster and Hadoop have been invaluable for our group's research. Our first solution worked like this: we processed a large web corpus (crawled by Jamie Callan and his group here at CMU) to extract all of the sentences in the web pages. This yielded a corpus of approximately 500 million unique sentences. We then wrote subroutines which submitted jobs to M45 to answer three types of queries: a) Given a list of patterns, what noun phrases fill in the blanks of those patterns? b) Given a list of noun phrases, what patterns do those noun phrases occur with? And, c) Given a list of patterns and noun phrases, how many times does each pattern co-occur with each noun phrase (or pair of noun phrases)? After some effort towards implementing these subroutines efficiently (e.g., using Trie data structures to efficiently decide if a given sentence contained any strings of interest), our routines were able to scan through our entire corpus in about 10 minutes using only 30 nodes on M45.

Extracting everything once
This first M45-based solution let us perform web-scale research that was simply not possible for us before. However, our system still needed to call the M45-based subroutines dozens of times due to its iterative nature. Thus, we decided to use M45 to generate a data set of statistics that could be copied to our local workstations and obviate the need to access M45 at run-time. We generated a data set that specified every noun phrase and every contextual pattern in the corpus, and the number of times each co-occurred with the other. This yielded a data set of about 50GB of uncompressed text, which we can scan through in 10-20 minutes on a single local machine.

To give an example of what the processed data looks like, here are the patterns that "Hadoop" occurs with the most times in the data (with counts):

instance of _   5
top of _        4
_ is a framework        4
users of _      3
_ is an open source implementation      3
clusters running _      3
versions of _   2
version of _    2
tutorial for _  2
system using _  2
system like _   2
processing system using _       2
platform called _       2
look at _       2
_ has been tested on    2
_ has been demonstrated on      2

Having this local data set lets us run multiple experiments at the same time on different machines, frees us up from having to wait for M45 when it's crowded, and frees up M45 for other users. We're also able to explore new algorithms that can learn to cluster similar noun phrases, try to understand the meaning of individual patterns, etc., without having to use M45 all the time. In addition, we are sharing this data with other students here at Carnegie Mellon by offering a course this semester where the sole purpose is for the students to do something cool with the data as a course project. We're excited to see the results!

The future
We recently parsed our entire set of 500 million sentences using M45. This is a scale seldom attained by academic researchers. Even though we used an off-the-shelf dependency parser that was reasonably fast (for a syntactic parser), the task still required the equivalent of 4,000 M45 node-days. It required breaking up the task into many small pieces and writing scripts to submit jobs based on the number of free nodes on M45. This let us avoid hogging the whole cluster. Whereas our extraction patterns have previously been based only on part-of-speech tags (e.g., noun, verb), now we can use syntactic parse tree fragments, enabling more flexibility and generality in matching.

We also now have an order of magnitude more data to play with, again thanks to Jamie Callan. We estimate that the English portion of the ClueWeb09 data set will yield 5 billion sentences once we process it.

Thank you, Yahoo!
Without the generous help of Yahoo!, we (and many other academic researchers at the schools fortunate to have access to M45) would be largely left on the sidelines, unable to do web scale research. Thank you, Yahoo!

Andy Carlson
Graduate Student @ Carnegie Mellon, Machine Learning Department

Justin Betteridge
Graduate Student @ Carnegie Mellon, Language Technologies Institute

Comments (2) | Permalink

October 5, 2009

Slides of September 23rd Bay Area Hadoop User Group

Thanks to those of you that attended the monthly Bay Area Hadoop User Group (HUG), held on September 23rd at Yahoo!. After some socializing, BEvERage, and snacks, we had 2 presentations:

Looking forward to seeing you at the next Bay Area HUG, planned for Oct 21st. RSVP is now open at http://www.meetup.com/hadoop/calendar/11532125/.

Dekel Tankel

Comments (0) | Permalink

October 1, 2009

New Update: Yahoo! Distribution of Hadoop

We have published multiple updates to the Yahoo! Distribution of Hadoop since it was announced in June. Each of these releases includes continuing bug fixes and feature backports to stabilize and enhance Hadoop for large scale deployments. Today we published our latest Yahoo! distribution of Hadoop 0.20.1.

You can checkout the complete set of changes we've included in today's distribution. Highlights include:

    • improving the robustness of the Capacity Scheduler (such as MAPREDUCE-532)
    • better support for memory intensive tasks (MAPREDUCE-516)
    • securing the task execution by running tasks as the job owner (HADOOP-4490)
    • numerous operability enhancements (such as HADOOP-5643)
    • performance improvements (such as HADOOP-3327)
    • increased metrics (such as HADOOP-5733)
    • many bug fixes

You should note that our GitHub repository has recently changed to track the Apache Hadoop project split that occurred a few months ago, when HDFS, MapReduce, and Common became separate sub-projects. As a result, those of you subscribed to watch the old repository will need to start watching the new repository. The old repository will remain in place, but will no longer be updated.

Nigel Daley
Quality and Release Engineering Manager
Hadoop Team

Comments (1) | Permalink

Do you have what it takes to join Yahoo!'s Hadoop Team?

First, an introduction. I'm Mark Tsimelzon, a recent addition to the Hadoop team. I'm Director of Engineering at Yahoo!, managing MapReduce and a bunch of projects with cute animal names that build database abstractions on top of Apache Hadoop. Having spent most of my career in various startups, I was not sure what I was getting myself into when I joined Yahoo!. To my amazement, what I discovered here was not so different from a startup. The Hadoop team at Yahoo! is filled with extremely smart, hard-working people, who care deeply about their job, Yahoo!, and Open Source. The team moves as fast as any startup does, even though the scale of the problems it solves would make any startup founder deeply envious.

The best part of being the a part of the Hadoop team at Yahoo! is that despite the current global economic situation this team is growing fast! This is not surprising - all of Yahoo! batch data processing is moving to Hadoop, and we need many more great people on to join this team. What follows is a quick list of openings we currently have, and it. It includes openings for developers, testers, architects, managers and directors. If you are interested in applying for any of these positions, please send your resume together with a few lines on why you want to work with us on Hadoop to hadoop-jobs-2009@yahoo-inc.com, using the position title as the message subject. Your resume will go straight to one of the hiring managers.

Senior Software Engineer
We are looking for great software engineers who have a wealth of experience with complex software systems, distributed systems, algorithms, data structures, and performance optimizations. Understanding of grid computing, databases, data warehouses, and especially database internals is a big plus. Expert Java skills are required. Experience with agile development and open source development is desired. 6+ years of relevant software development experience are desired.

Software Architect / Team Leads
We are looking for great architects and technical team leads who have a proven track record of designing and delivering complex software systems. Thorough knowledge of distributed systems, algorithms, data structures, performance optimization, scalability, and reliability issues are required. Understanding of grid computing, databases, data warehouses, and especially database internals is a big plus. Solid Java skills are required. Experience with agile development and open source development is desired. 8+ years of relevant experience, including 4+ years in the architect / team lead role are desired.

Senior Java Performance Engineer
Our Grid Hadoop Performance/Utilization team is looking for a senior performance engineer with expert Java/JVM knowledge to help us: evaluate and propose optimal JVM tuning options for best performance; evaluate various JVM's for best performance and stability as a continuous process; profile the Java code for Grid software stack to find performance bottlenecks and find propose innovative ways to eliminate them; characterize all aspects of HDFS and MapReduce performance; participate in design reviews and propose innovative solutions for performance and scalability improvement throughout the life cycle of product; measure system resource utilization and efficiency, and mine and analyze large amount of logs and traces to identify improvement opportunities; champion the techniques of writing best java code for high performance for writing high performance Java code.

Director of Software Engineering, Hadoop Systems
Our Hadoop development team is looking for a world class software leader with a strong systems management and architecture background to lead our investments in HDFS, ZooKeeper and Hadoop performance. This job involves leading and growing a team of 20 engineers, managers and architects at the heart of the Hadoop open source community. We are looking for someone with a history of driving delivery of complex distributed systems and/or complex systems software such as file systems and operating systems. Our environment is open, collaborative and fast paced. It is filled with very smart and independent people. We require our leaders to nurture such an environment while demanding delivery and high standards in our work product. 8+ years of software development and an addition 8+ years of architecture / management required. Experience with Java, Unix, C++, agile development and open source development desired.

Hadoop Software Quality Engineer Architect
As a QE Architect with 10+ years of experience, you will lead the design and implementation of test plans, test cases, and test frameworks across a number of open source Apache projects related to data processing that underpin the Yahoo! Cloud Computing infrastructure: Pig, Owl, and Zebra. In this highly technical role, you will interface with QA managers, other architects, leads, developers, product managers and operation teams to complete projects. You should possess skills in architecting and lead test efforts for backend components as well as system performance and reliability testing. Excellent Java or C++ coding skills required for this hands-on position.

Senior Whitebox Quality Engineering Lead for Owl
As a Whitebox QE Lead with 5+ years of experience, you will contribute to the leading, design and implementation of test plans, test cases and validation using test tools of complex, distributed software. You will interface with QA managers, other leads, developers, product managers and operation teams to complete projects. You should possess skills in leading and testing APIs of backend components as well as system performance testing. Excellent Java or C++ coding skills required for this position.

Senior Whitebox Quality Engineering Lead for ZooKeeper
Think you got what it takes to engineer testing for ZooKeepers? 5 guarantees? (Sequential Consistency, Atomicity, Single System Image, Reliability, Timeliness) We're looking for a Whitebox QE Lead with at least 5 years of experience to lead, design and implement test strategy, test infrastructure, and test cases. Working with a small development team, you will interface with QA managers, other leads, developers, product managers and operation teams to complete projects. You should possess skills in leading and testing APIs of backend components as well as system performance testing. Excellent Java or C++ coding skills required for this position.

Sr. Manager of Data Processing Quality Engineering
The Hadoop team at Yahoo! is seeking an experienced, hands-on Sr. Whitebox QE Manager with strong technical skills (including coding) to lead and grow a team of quality engineers in delivering world-class data processing services as part of Yahoo! Cloud Computing platform. You will define test strategy, execute test plans, review tests and product specifications, participate in defining and selecting appropriate test tools and automation strategy, and lead a strong team to deliver a high quality product. You will work closely with architects, development, program management, operations, and customers to test and release products.

Once again, please send your resume to hadoop-jobs-2009@yahoo-inc.com, using the position title as the message subject.

Mark Tsimelzon
Director of Engineering
Hadoop Team

Comments (0) | Permalink

Copyright © 2009 Yahoo! Inc. All rights reserved. Copyright | Privacy Policy

Help us continue to improve the Yahoo! Developer Network: Send Your Suggestions