Yahoo! Developer Network Blog

« Previous | Main | Next »


September 30, 2009

Yahoo! at Hadoop World in New York

As the world's largest user and contributor of Hadoop, Yahoo is excited to be sponsoring and presenting at the upcoming Hadoop World in New York City on Friday October 2, 2009. Yahoo has been using Hadoop since the beginning of 2006 and have built up our Hadoop clusters from 20 machines up to a current total of more than 24,000 machines.

Eric Baldeschwieler, the VP of Hadoop Development, will present how we've grown Hadoop into Yahoo's primary batch data analysis platform. Hadoop at Yahoo supports complex data analysis and mining, display advertising, our content platforms, personalization, filtering email spam, and continuing research in improving our products. Not only has Hadoop reduced development time across a wide range of data analysis projects, it has increased access to data by removing data silos, and enabled projects that would have been previously impossible.

Owen O'Malley, an architect on the Yahoo Hadoop team and the Apache VP of Hadoop, will present our upcoming efforts two emerging areas of Hadoop development: Security and backwards compatibility. As a central part of Yahoo's data analysis platform, confidential data will be stored on the Hadoop clusters. The current "friendly" security in Hadoop prevents accidents, but doesn't slow down anyone trying to work around it. Toward this end, we are integrating Hadoop with Kerberos and using strong authentication to ensure that all users of Hadoop are who they claim to be and their access is limited appropriately. On the other front, Hadoop, as with all quickly growing projects, has had incompatible API and protocol changes in each major version. This requires application writers to change and recompile their applications and update the client versions whenever a new version of Hadoop is deployed to the cluster. Starting in the upcoming release of Hadoop 0.21, we're annotating the APIs with the intended audience of the interface (public, limited, private) and the stability of the interface (stable, evolving, unstable) and guaranteeing that users of the public stable interfaces will run without a recompilation on new versions of Hadoop. Yahoo also started the Hadoop Avro project that will let us accommodate different versions of clients connecting to the same server. All of these threads are leading Hadoop toward a 1.0 release.

Viraj Bhat, an engineer on Yahoo's Hadoop Solutions team, will present his work on Vaidya, which is a contrib project in Hadoop MapReduce. MapReduce hides many details of parallelization, fault-tolerance, data distribution and load balancing to simplify application development. However, tuning the performance of individual jobs with different data processing and resource utilization characteristics is a significant challenge, even for seasoned parallel programmers. Hadoop Vaidya is an extensible rule based performance diagnostic tool for MapReduce jobs. It performs a post execution analysis of map/reduce jobs by parsing and collecting their execution statistics through job history and job configuration files. It runs these inputs against a set of predefined tests/rules to diagnose various performance problems and provides a targeted advice to the users through XML reports. At Yahoo, we use Vaidya to analyze thousands of MapReduce jobs running daily on our clusters to detect potential performance improvements.

Jake Hofman, a research scientist at Yahoo, will present his work on developing a large-scale network analysis package. Over the last several years there has been a rapid increase in the number, variety, and size of readily available (social) network data. As such, there is a growing demand for software solutions that enable one to extract relevant information from these data, often leveraging tools from network analysis. For sufficiently large networks (with, e.g., tens or hundreds of millions of nodes) distributed solutions are often necessary, as the storage and memory constraints of single machines are prohibitive. The new package enables such calculations on standard Hadoop clusters. A high-level overview of the package will be provided, followed by a discussion of algorithms for calculating node-level features in the map/reduce framework. He will demonstrate the package on several real-world networks and discuss use of the calculated network features for predictive modeling tasks.

For those of you who haven't registered yet, here is a discount code. Also come an join us in a more informal session of lightening talks the previous night. I hope to see you all there!

-- Owen O'Malley

Posted at September 30, 2009 11:11 AM

Bookmark this on Delicious

Comments

Post a comment

Comment Policy: We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service.

Remember Me?

Hadoop is a trademark of the Apache Software Foundation.

Copyright © 2010 Yahoo! Inc. All rights reserved. Copyright | Privacy Policy

Help us continue to improve the Yahoo! Developer Network: Send Your Suggestions