Distributed Computing Archive: September 2008

« Previous | Main | Next »


September 30, 2008

Scaling Hadoop to 4000 nodes at Yahoo!

We recently ran Hadoop on what we believe is the single largest Hadoop installation, ever:

• 4000 nodes
• 2 quad core Xeons @ 2.5ghz per node
• 4x1TB SATA disks per node
• 8G RAM per node
• 1 gigabit ethernet on each node
• 40 nodes per rack
• 4 gigabit ethernet uplinks from each rack to the core (unfortunately a misconfiguration, we usually do 8 uplinks)
• Red Hat Enterprise Linux AS release 4 (Nahant Update 5)
• Sun Java JDK 1.6.0_05-b13
• So that's well over 30,000 cores with nearly 16PB of raw disk!

The exercise was primarily an effort to see how Hadoop works at this scale and gauge areas for improvements as we continue to push the envelope. We ran Hadoop trunk (post Hadoop 0.18.0) for these experiments.

Scaling has been a constant theme for Hadoop: we, at Yahoo!, ran a modestly sized Hadoop cluster of 20 nodes in early 2006; currently Yahoo! has several clusters around the 2000 node mark.


HDFS

The scaling issues have always been the main focus in designing any HDFS feature. Despite these efforts, attempts to scale the cluster up in the past sometimes resulted in some unpredictable effects. One of the most memorable examples was the cascading crash described in HADOOP-572, when failure of just a handful of data-nodes made the whole cluster completely dysfunctional in a matter of minutes.

This time the testing went smoothly and we observed quite decent file system performance. We did not see any startup problems; the name-node did not drown in self-serving heartbeats and block reports. Note, that heartbeat and block intervals were configured with the default values of 3 seconds and 1 hour respectively.

We ran a series of standard DFSIO benchmarks on the experimental cluster. The main purpose of this was to test how HDFS handles load of 14,000 clients performing writes or reads simultaneously.

HDFS Cluster Statistics

Capacity : 14.25 PB
DFS Remaining : 10.61 PB
DFS Used : 233.44 TB
DFS Used% : 1.6 %
Live Nodes : 4049
Dead Nodes : 226

Map-Reduce Cluster Statistics

Nodes: 3561
Map Slots: 4 slots per node
Reduce Slots: 4 slots per node

DFSIO benchmark is a map-reduce job where each map task opens a file and writes to or reads from it, closes it, and measures the i/o time. There is only one reduce task, which aggregates and averages individual times and sizes. The result is the average throughput of a single i/o that is how many bytes per second was written or read on average by a single client.

In the test performed each of the 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job.

The table below compares the 4,000-node cluster performance with one of our 500-node clusters.

Table 1. Throughput
  500-node cluster 4000-node cluster
  write read write read
number of files 990 990 14,000 14,000
file size (MB) 320 320 360 360
total MB processes 316,800 316,800 5,040,000 5,040,000
tasks per node 2 2 4 4
avg. throughput (MB/s) 5.8 18 40 66

The 4000-node cluster throughput was 7 times better than 500’s for writes and 3.6 times better for reads even though the bigger cluster carried more (4 v/s 2 tasks) per node load than the smaller one.


Map-Reduce

The primary area of concern was the JobTracker and how it would react to this scale (we had never subjected the JobTracker to heartbeats flowing in from 4000 tasktrackers since it isn't a very common use-case when we use HoD). We were also concerned about the JobTracker's memory usage as it serviced thousands of user-jobs.

The initial results were slightly worrisome - GridMix, the standard benchmark, took nearly 2 hours to complete and we lost a fairly large number of tasktrackers since the JobTracker couldn't handle them. For good measure, we couldn't run a 6TB sort either; we kept losing tasktrackers. (We routinely run sort benchmarks which sort 1TB, 5TB and 9TB of data.)

Of course, brand-new hardware didn't help since we kept losing disks, neither did the fact that we had a misconfigured network which let us use only 4 out of the 8 uplinks available from each rack to the backbone (effectively cutting the available bandwidth in half). On the bright side memory usage didn't seem to be a problem and the JobTracker stood up to thousands of user-jobs without problems.

We then went in armed with the YourKit(TM) profiler - we needed to peek into the JobTracker's guts while it was faltering. This basically meant going through the CPU/Memory/Monitors profiles of the JobTracker with a fine-toothed comb. To cut a long story short, here are some of the curative actions we took based those observations:
HADOOP-3863 - Fixed a bug which caused extreme contention for a single, global lock during serialization of Java strings for Hadoop RPCs.
HADOOP-3848 - Cut down wasteful RPCs during task-initialization.
HADOOP-3864 - Fixed locks in the JobTracker during job-initialization to prevent starvation of tasktrackers' heartbeats which caused the huge number of 'lost tasktrackers'.
HADOOP-3875 - Fixed tasktrackers to gracefully scale heartbeat intervals when the JobTracker is under duress.
HADOOP-3136 - Assign multiple tasks to the tasktrackers during each heartbeat, this significantly cuts down the number of heartbeats in the system.

The result of these improvements (sans HADOOP-3136 which wasn't ready in time):
1. GridMix came through slightly under an hour - a significant improvement from where we started.
2. The sort of 6TB of data completed in 37 minutes.
3. We had the cluster run more than 5000 Map-Reduce jobs in a window of around 6 hours and the JobTracker came through without any issues.

Overall, the results are very reassuring with respect to the ability of Hadoop to scale out. Of course we have only scratched the surface and have miles to go!


Konstantin V Shvachko
Arun C Murthy
Yahoo!

Posted by aanand at 2:04 PM | Comments (5) | TrackBack | Permalink

September 25, 2008

Hadoop 0.18 Highlights

Apache Hadoop 0.18 was released on 8/22. This is the largest Hadoop release to date in terms of the number of patches committed (266). It also has the largest percentage of patches (20%) from contributors outside of Yahoo!. This is a great indicator of both the growth of the Hadoop community and their increasing involvement in the projects progress. The size of the release resulted in a very large number of blocking bugs in the code base. Unfortunately, this created a big delay between the feature freeze on 6/4 and the final release.

Hadoop 0.18 has many improvements in the areas of performance, scalability and reliability in addition to new features. Some of the performance improvements contributed to Hadoop’s first place in the terabyte sort benchmark. Hadoop 0.18 runs the grid mix benchmark in ~45% of the time taken by Hadoop 0.15. Lots of cool new stuff in this release, some of which is briefly described below.

HDFS


Namespace auto-recovery
The HDFS Namenode can store the filesystem image and journal in multiple locations. Upon startup it automatically consults all configured locations of its state and reads the most up to date image and journal. If all of the Namenodes copies of data are unavailable state can be (mostly) recovered from the secondary Namenode using the ‘¬-importCheckpoint’ switch. More details can be found in HADOOP-2585.
Fast restart
Namenode re-start, particularly for large clusters has been slow. It has until Hadoop 0.17 taken, for instance, over an hour to bring up a Namenode on a 2000 node cluster. Problems included inefficiencies in block report processing, getting stuck in safe mode etc. Most of these have been addressed and Namenode startup on up to 3000 node clusters happens in <15 minutes. Discussion can be found in HADOOP-3022.
Namespace quotas and archives
HDFS now has directory-based quotas for namespace management. A quota set on a directory limits the number of entries in that sub-tree to the quota value. Only the super-user may set or change quotas. Quotas can be manipulated programmatically or via command line utilities. HADOOP-3187 describes quotas in detail.
RPC performance and scaling improvements
This release comes with a significant re-vamp of the RPC subsystem in the form of HADOOP-2188, HADOOP-2909 and HADOOP-2910. This includes the use of pings instead of timeouts, improvements in the management of idle connections and client throttling when the server is under load. These improvements will have the greatest effect of large clusters (>1000 nodes) and prevent jobs from failing when the Namenode or Jobtracker are under load.
Read/write performance improvements
HADOOP-1702 reduces buffer copies while writing to HDFS and brings down Datanode CPU usage during writes by 30%. HADOOP-3164 used sendfile on the Datanodes for reads from HDFS. This results in an 80% reduction in CPU usage by the Datanodes.
Audit logging
HADOOP-3336 introduces audit logging for HDFS. The Namenode logs all file and directory accesses. An audit log entry includes the originating IP, action requested, pathname accessed, client user and group id and existing permissions on the accessed pathname.
Append … almost there
Lots more work to support append which unfortunately did not make the cut for Hadoop 0.18. Notable changes include lease recovery for append (HADOOP-3310), datanode generation stamp upgrades (HADOOP-3283) and lease management when open files get renamed (HADOOP-3176).
Mounting via FUSe
The oldest open Hadoop bug, HADOOP-4 is now closed. This work enables HDFS mounting via FUSE.

Map/Reduce


Intermediate compression that just works
Compression of intermediate outputs in Hadoop Map/Reduce has long been a source of grief. Intermediate compression, when enabled would frequently induce job failure by causing tasktrackers to run out of memory, run slow, cause disk thrash etc. HADOOP-3366 and HADOOP-2095 address these problems by introducing a different file format for shuffle data and improving memory management in reduce tasks. Intermediate compression may now be enabled with the supported codecs (gzip and lzo).
(Single) reduce optimizations
Many important optimizations in sort/merge in reduce tasks. HADOOP-3297 improves the fetching of many small outputs. HADOOP-3365 eliminates unnecessary buffer copies during the merge phase. HADOOP-3429 improves Hadoop streaming performance by buffering the i/o paths to streaming processes.
Archive tool
Quotas are complemented by ‘Hadoop archives’, which are a tool for users to manage their namespace consumption. A large number of files can be converted into a Hadoop archive using a Map/Reduce utility. A Hadoop archive is basically an HDFS directory with a small number of data files that consist of files from the original set concatenated together. An index stores the location of each file from the original set. Individual files in an archive can be accessed using a special URI with the ‘har’ schema. Archives and their use are discussed in HADOOP-3188.

Sameer Paranjpye
Yahoo!

Posted by aanand at 7:45 AM | Comments (0) | TrackBack | Permalink

Copyright © 2010 Yahoo! Inc. All rights reserved. Copyright | Privacy Policy

Help us continue to improve the Yahoo! Developer Network: Send Your Suggestions