Yahoo! Developer Network Blog
« Previous | Main | Next »
April 28, 2008
Hadoop 0.17 Preview
Apache Hadoop 0.17 is due for release any day now. Feature freeze for the release was on April 4th. The Hadoop dev community is currently actively fixing blocking issues discovered by users that have tried it out. This is a release we’re very excited about as it introduces many long awaited performance fixes to the platform. We’ve observed on the order of 30%(!) improvement in the runtime of some of the Hadoop benchmarks. As always, user feedback is invaluable and we urge folks to kick the tires on the release and help close it out. Here is a quick rundown of the important changes in the release.
HDFS
- Syntax cleanup of Hadoop fs shell commands
The syntax of many Hadoop fs shell commands has been revised. The goals have syntax consistency across commands and compatibility with POSIX syntax as far as possible. For examples, see HADOOP-1677, HADOOP-1792, HADOOP-1891 - Block placement changes
HDFS’ block placement strategy is now more tuned towards evenly distributing data across nodes. When a block is written from a node that is not running a Datanode, the first replica is written to a node on the same switch as the writer (instead of being written to the same node as the writer). The remaining two replicas are written to two nodes on a different switch. This strategy produces substantially better distributions in cases where data is loaded into HDFS from a small number of machines. More details at HADOOP-2559. - Append... getting closer
Everyone’s favorite bug, HADOOP-1700, is still not closed, but a lot of progress has been made in this release. Append related work will make possible new flush() and sync() methods in the HDFS interface. The semantics are familiar. The flush() call plonks data into the ‘system buffers’ and returns immediately. The sync() call writes the data to the system, and returns only when it has hit disk. - More efficient replication
Losing a rack of nodes usually means that the Namenode has to replicate around half a million blocks, and quickly. This would cause Namenode responsiveness to degrade and would increase the risk of data loss. A faster replication scheduling algorithm in the Namenode enables it to maintain quality of service to clients and replicate data faster in the event of losing many nodes at once.
Map/Reduce
- Switch awareness
Hadoop map/reduce is now capable of switch aware task placement. The framework attempts to place tasks on machines where their input data resides. In many cases (in particular with HoD), machines with input data are not available for running tasks, but machines that share a switch with such ‘input nodes’ are. Hadoop will now attempt to place tasks on machines that are switch local to input when machines with input data are unavailable. - Faster task scheduling
HADOOP-2119 removes many inefficiencies in task placement and scheduling logic. The JobTracker would perform linear scans of the list of submitted tasks in cases where it did not find an obvious candidate task for a node. With better data structures for managing job state, all task placement operations now run in constant time. - Sort and shuffle improvements
A couple of significant improvements to sort and shuffle are included in the form of HADOOP-910 and HADOOP-2919. HADOOP-910 has reducers performing merges of shuffle data (both in memory and on disk) while fetching map outputs. HADOOP-2919 improves memory management in sort on the map side, substantially decreases setup cost for the sort and uses quicksort instead of mergesort as the sorting algorithm.
Sameer Paranjpye
Yahoo! Grid Computing Team
Posted at April 28, 2008 2:13 PM
Comments
Beautiful!
Is 0.17 backwards-compatible - a drop-in replacement for 0.16.*?
Thanks.
Posted by: Otis Gospodnetic at April 29, 2008 11:30 AM | Permalink
Post a comment
Comment Policy: We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service.
Hadoop is a trademark of the Apache Software Foundation.
Subscribe
Recent Blog Articles
view all
Hadoop Bay Area User Group - Feb 17th at Yahoo!, Sunnyvale
Wed, 03 Feb 2010
Comparing Pig Latin and SQL for Constructing Data Processing Pipelines
Fri, 29 Jan 2010
Video from Jan. 20, 2010 Hadoop Bay Area User Group now online
Thu, 28 Jan 2010
Stomping out Java "concurrency cockroaches" with SureLogic's Flashlight and JSure tools
Tue, 26 Jan 2010
Hadoop Bay Area January 2010 User Group - Recap
Thu, 21 Jan 2010
Recent Links
Appcelerator Titanium + Yahoo YQL on Vimeo
Mon, 08 Feb 2010
Tue, 02 Feb 2010
PhoneGap | Cross platform mobile framework
Sat, 30 Jan 2010
Web developers can rule the iPad - O'Reilly Radar
Sat, 30 Jan 2010
rc3.org - Is the iPad the harbinger of doom for personal computing?
Thu, 28 Jan 2010
Archives
Recent Readers

