Distributed Computing Archive: March 2009
« Previous | Main | Next »
March 23, 2009
Hadoop at ApacheCon EU
ApacheCon EU 2009 is in Amsterdam this week. Similar to the Hadoop Camp at ApacheCon US last year in NewOrleans, there is a full day of Hadoop talks scheduled. Live video streaming of the Hadoop track is available for a fee from the ApacheCon EU site.
The Hadoop track this year is on Wednesday March 25th and includes:
* Opening Keynote - Data Management in the Cloud - Raghu Ramakrishnan
* Introduction to Hadoop - Owen O'Malley
* Hadoop Map-Reduce: Tuning and Debugging - Arun Murthy
* Pig: Making Hadoop Easy - Olga Natkovich
* Running Hadoop in the Cloud - Tom White
* Configuring Hadoop for Grid Services - Allen Wittenauer
* Dynamic Hadoop Clusters - Steve Loughran
We will end the day with a birds of a feather session with the Yahoo! Hadoop team.
Posted by aanand at 2:24 PM | Comments (0) | TrackBack | Permalink
Apache ZooKeeper: the making of
After working for a long time on many different distributed systems, you learn to appreciate how complicated the littlest things can become. For example, when running an application on a local machine, changing configuration of an application involves clicking on a gui or, at worst, editing a file and restarting the app. However, distributed applications run on different machines and need to see configuration changes and react to them. To make matters worse, machines may be temporarily down or partitioned from the network. Not only do these outages make things hard to configure, but they also make application health no longer a choice between dead or alive; you also have mostly alive or dead and the dreaded half dead. Robust distributed applications also have the ability to incorporate new machines or decommission machines on the fly. Partial failures together with elastic machine ensembles mean that even the configuration of the distributed application should be dynamic. To make matters worse theoretical results such as the FLP proof (consensus is impossible with asynchronous systems and even one failure) and the CAP theorem (strong Consistency, high Availability, and Partition-tolerance: pick two, you can't get all three) mean that some compromises must be made.
As we were working with the different systems here at Yahoo, we had some very general requirements of distributed applications in this context. First, applications need "ground truth"; they need an oracle, or service, that they can just believe without second guessing. Second, the service needs to be as simple as possible, both to decrease the likelihood of bugs and make it as easy to understand as possible. Third, the service needs to have good performance so that applications can use the service extensively. If a developer goes to the trouble of integrating the service into their application, they should be able to make full use of it.
We designed ZooKeeper to meet these requirements. (We already had a few distributed systems projects with animal names, and the term zoo conjures up a sense of chaos that tend to prevail in large systems.) Our background in distributed file systems motivated a hierarchal namespace and file system like model. That same background gave us insight into some of the features of file systems that are particularly hard to implement. (rename is the worst!) We also thought that such a model would make it familiar to new developers since the file API is one of the earliest learned. We added a couple of things such as the ability to set watches, callbacks, that will trigger on specific changes to files and ephemeral files which disappear if the client that created them disconnects (on purpose or due to failure) from ZooKeeper.
For reliability, we needed the service to be provided by a cluster of servers. We have lots of clients and need high performance so we allow a client to connect to any active server in the cluster. Since our initial target applications were very read dominant, we wanted read requests to be serviced by replicas without having to coordinate with other replicas. We also didn't want to use locks for updates, both for the detrimental impact on performance, but also for the complications locks make on implementation. So, we order all update requests and guarantee that all replicas see the same order of updates. In the end all update requests are totally ordered and all reads are ordered with respect to update requests. Of course, there are plenty of other details, but these were our key choices.
Our decision to focus on ordering turned out to be key to moving from a dynamic configuration service to a full fledged coordination service. We had shown our production partners that things like configuration, leader election, group membership, and server status could all be done easily with ZooKeeper, but after hearing about a service from Google called Chubby they became convinced that the needed locks. We looked at adding a lock method to our service, but we could not get it to fit nicely into our design. Our implementation was completely wait-free, and adding a blocking method would require a complete redesign. Soon we realized that by taking advantage of ordering and watches we could implement efficient locks at the client. This lead us to start documenting some higher level primitives that can be implemented by clients without modifying the ZooKeeper server at all. We call them recipes. Once we got started, we realized we could do all sorts of coordination primitives like read-write locks, preemptable locks, queues, barriers, etc. As an interesting postscript, the group that initially requested the locks ended up not needing them after all.
ZooKeeper's adoption into production was faster than we expected or could handle. The first group we pitched the idea to immediately adopted it for their next project. Eventually we stopped pitching ZooKeeper, because we needed to focus on implementation and we were getting too much interest. We had written a prototype implementation, but the code base that is now ZooKeeper was implemented from scratch in a period of three months by a single developer. Our initial users gave a lot of feedback and help get through a lot of the early bugs (thanks Mark and Zeke!). Of course there have been many bugs that have been fixed since then and our developer base has grown to four. By the end of the first year of ZooKeeper's development we open sourced it using SourceForge. A few months after that we became an Apache subproject under Hadoop.
There are still plenty of things to do: partitioned namespace, more performance enhancements, higher level client primitives, etc. So do you have a distributed system? (Yes, an application that runs on two machines counts!) Want to make the world a better place? Want to save humanity from utter chaos? Join the discussion! Contributions welcome!
Benjamin Reed
Yahoo! Research
Apache ZooKeeper Committer
Posted by aanand at 9:35 AM | Comments (0) | TrackBack | Permalink
March 18, 2009
Using Hadoop to fight spam - Part II
Hadoop helps Yahoo! Mail reduce spam for over 300 million people. In the second part of their talk, Mark Risher and Jay Pujara, Yahoo! Mail technologists and spam cops, describe how they are using Hadoop to fight botnets.
Mark and Jay explain how Hadoop makes it possible for Yahoo! Mail to quickly analyze huge sets of data to identify where spam comes from. They describe how team members can quickly come up to speed and submit queries on the data using Pig, and how they see this effort evolving. Listen in as they describe their experiences:
Posted by aanand at 10:15 AM | Comments (0) | TrackBack | Permalink
March 4, 2009
Using Hadoop to fight spam - Part 1
We interviewed Mark Risher and Jay Pujara, leaders in the war against spam for Yahoo! Mail. With over 300 million users and billions of mesages, looking for problems or patterns to identify spammers can be a daunting task. Mark and Jay describe how their previous approach using databases quickly ran into scalability limitations as they analyzed data aggregated over a month or more. They explain how Hadoop, with Pig and Streaming, now enables them to slice through billions of messages to isolate patterns and identify spammers. They can now create new queries and get results within minutes, for problems that took hours or were considered impossible with their previous approach. Listen in as Mark and Jay describe their experiences fighting spam:
Posted by aanand at 1:24 PM | Comments (1) | TrackBack | Permalink
March 1, 2009
Hadoop User Group
Hadoop User Group meetings have now been held in Beijing, Berlin, London, New York, San Diego and Washington DC, in addition to the Bay Area, with one in the works in Bangalore. In the Bay Area, we typically host them on the third Wednesday of each month at the Yahoo! campus in Santa Clara.
The meeting last week featured Matei Zaharia from UC Berkeley talking about the Fair Scheduler for Hadoop. The need for a scheduler has been a known requirement for quite a while, and Matei got started working on this while he was an intern at Facebook. His talk described their goals of providing fast response time for small jobs and guaranteed SLA’s for production jobs. It then discussed the concept of pools, the scheduling algorithm for assigning resource capacity, as well as installation, configuration and administration of the scheduler.
This was followed by a talk from Aaraon Kimball from Cloudera on Importing Data from MySQL which discussed techniques for loading data from databases into HDFS.
Next month’s user group meeting will feature Yahoo!’s Milind Bhandarkar talking about performance enhancement techniques for Hadoop developers.
Posted by aanand at 8:14 PM | Comments (1) | TrackBack | Permalink
Subscribe
Recent Blog Articles
view all
Slides from Hadoop World and University Talks
Wed, 28 Oct 2009
Hadoop User Group (HUG) – Oct 21st at Yahoo!
Fri, 23 Oct 2009
M45 Enables Web-Scale Information Extraction Research
Fri, 23 Oct 2009
Slides of September 23rd Bay Area Hadoop User Group
Mon, 05 Oct 2009
New Update: Yahoo! Distribution of Hadoop
Thu, 01 Oct 2009
Recent Links
Web addresses may adopt non-English characters | Digital Media - CNET News
Mon, 26 Oct 2009
Yahoo Open Hack NYC - Open Blog - NYTimes.com
Thu, 15 Oct 2009
Music Hack Day - Boston - Nov 20-21
Sun, 11 Oct 2009
A List Apart: Articles: Discovering Magic
Tue, 06 Oct 2009
Building iPhone Apps with HTML, CSS, and JavaScript
Sun, 04 Oct 2009
Archives

