Yahoo! Developer Network Blog
« Previous | Main | Next »
June 19, 2009
Notes from the NoSQL Meetup
General Comments
I attended the first NoSQL meetup in SF last Thursday and took some notes. This meeting was a presentation and discussion of distributed, non-relational database systems (DNRDBMS). Examples include Google's Bigtable, Amazon's Dynamo, and Yahoo!'s UDB and Sherpa. A couple of the systems presented were built on Hadoop's HDFS.
This was an exciting meeting -- the vibe was great, everyone was supportive and interested. Lots of emphasis on architecture, features and technology, lots of smart people. I learned a ton. This writeup is basically a narrative of my impressions during the presentations. A more structured comparison of the projects would be interesting too. The focus was mainly, but not totally on open source implementations. I was very impressed by all of the people and projects presented.
Please note that this written from my point of view as the product manager for Sherpa, Yahoo!'s internal cloud key-value store. Despite any real or perceived snark, I was very impressed by all of the people and projects presented, and by the excitement the event generated. Kudos to Johan Oskarsson from Last.fm who put it together (from London, no less), CBS Interactive for hosting, and Digg for buying the beer.
Here at Yahoo!, we're interested in helping out with any follow-up meetups. This is a really interesting area of research and experimentation right now, as the variety of implementations attest.
I'd be familiar with the Bigtable and Dynamo papers before reading this summary -- it will help explain what people are trying to do in this space.
There were 3 fairly distinct types of non-relational databases shown.
Distributed key-value stores (Dynamo-like systems):
Distributed column stores (Bigtable-like systems):
Something a little different:
*CouchDB
Yahoo! has a lot of experience with key-value stores. Our venerable user database (UDB) runs on thousands of machines, in dozens of datacenters, with dynamic, record-level replication. As far as I know, we've never shared any architectural or performance information about the UDB externally. It will be nice to be more open about Sherpa, our next-generation key-value store.
Many of these systems are application specific; the trade-offs involved in building distributed systems (the CAP, latency vs. durability, etc.) seemed driven by the needs of the application, rather than by awareness and consideration of what the trade-offs meant. For example, preseneters discussed write latencies so low that there was clearly no disk access involved. No one felt the need to acknowledge that there would be data loss if the machine crashed before a sync.
I didn't hear a mention of the "cloud computing" buzzword till 3:30 in the afternoon. Very few of these apps presented are actually ready for a cloud (massive scale, multi-colo) deployment, but most of the users would like them to be.
I was a little disappointed in the testing rigor of the projects. Performance testing is a underappreciated art. At Yahoo!, we've spent a lot of time testing to precisely characterize our system performance. There are a number of factors that have a big impact on performance, including record size, consistency model, cache hit ratio, read/write ratio, etc. and it would be good to see these reflected in system characterization.
Here are my notes from the specific speakers:
Introductions
Todd Lipcon, ClouderaReally excellent summary of many of the issues we are trying to solve with Sherpa and these systems in general. Here is his deck, embedded from Slideshare.net:
*Partitioning schemes (does maintaining a map really not scale?)
*Data models
*Consistency models
*Conflict resolution
*Storage layout
*Cluster state management
*API (get, set, delete, multi-get)
*Automated recovery
*Performance
I thought of a few other issues we've had to deal with:
*Cross datacenter replication
*Network architecture (guess no one here has melted a switch)
*Operational scale
*Cluster expansion
It would also be interesting to explore the trade-offs inherent in some of the design decisions. Maybe Yahoo! could contribute here, since we deal with these trade-offs every day.
Voldemort
Jay Kreps, LinkedInVoldemort is a distributed key-value store written in Java. Jay is a interesting laid-back guy with some strong opinions. This was one of the best things about the presentations: people said what they thought and the audience seemed cto listen and understand, rather than attack (until later, when we got to the java/c++ stuff). These opinions were definitely a memorable part of his presentation.
Notes:
*Voldemort is a partitioned key-value store, more like Sherpa than Dynamo.
*Conventional RDBMS don't scale for services.
*Make model fit implementation (Make it easy for developers to understand the impact of specific methods.)
*Sherpa scan is a counter-example -- it looks easy but crushes the system.
*90% of caches fix problems that shouldn't exist; all caching should be in the storage layer
*Voldemort has a flexible deployment architecture that makes it possible to reduce the number of hops an operation takes (e.g., put the router in the client).
*All serialization methods suck--in 5 years we will forget about this problem. Voldemort supports a whole bunch of them
*Partitioning should handle nodes with different performance characteristics.
*Storage is pluggable.
*HTTP client doesn't perform; they use a custom socket protocol. We've had the same problems with almost all HTTP clients.
*They are planning for Hadoop integration; haven't done it yet.
*Vector clocks for conflict resolution
Cassandra
Avinash Lakshman, FacebookDefinitely the belle of the ball -- the presenter was one of the designers of Amazon's Dynamo, and brought a lot of those ideas to Cassandra, a Bigtable clone used for Facebook's mail search. Avinash had a very polished presentation with a bit more rigor than many of the other presenters.
Rackable Rackspace seems to be planning a cloud deployment of Cassandra in the near future.
Notes:
*Highly available; eventual consistency only
*Replication knobs available (like Dynamo)
*For performance, they rely on application specific behavior -- when a user positions their mouse in their inbox search box, their index gets pulled into memory before the search term is entered.
*Does not like pluggable storage models; says that app-specific optimizations are too valuable to generalize away. Specific example was a zero-copy streaming network copy enabled by a Linux call -- data can be streamed from/to disks over the network. Hmmm...
*180 node/50 TB system
*Messaging includes failure simulations (like dropping random messages)
*Future directions: ACLs (very familiar for Yahoos), transactions, compression. communicative ops, pluggable (app-specific) inconsistency reconciliation
*Simple designs are better designs.
*Test by tee-ing network traffic
Dynomite
Cliff Moon, Microsoft (Powerset)A Dynamo clone written in Erlang. Dynomite and CouchDB are both written in Erlang, which seems to shorten development time and the expense of performance. Cliff talked about the need to rewrite the critical bits in C. He showed some code samples. Erlang is not immediately intuitive for rusty C/Perl programmers.
Dynomite seemed to be fairly immature technology -- while it's being used in production for image-scraping Wikipedia, the cluster size was small (12 systems) and the system only ran in batch mode. System doesn't support delete (hard in distributed environments).
Cliff's best point was about the need for consistent APIs between layers. Clearly there is a proliferation of key-values stores and we should be thinking about some standardization at some level.
Notes:
*Performance at the low end of the scale: 99% reads (100 byte records) at 1 sec.
*Very configurable--lots of knobs to turn for performance/scale. Very Dynamo-esque in this way.
*Lots of serialization protocols.
HBase
Ryan Rawson, StumbleUponHbase is a Bigtable clone written on top on of HDFS. It's used by StumbleUpon and Powerset among others. Hbase is Java; (Hypertable is essentially the same thing written
in C++). Hbase is a cool project and they've made impressive performance improvements on previous versions. Stumble is actually serving data off Hbase, which I always thought was prohibitively slow. Ryan began by explaining how Stumble chose Hbase -- they considered Cassandra, Hypertable, and Hbase. Cassandra didn't work, Hypertable was fast and Hbase was slow. They chose Hbase because of the great community.
Notes:
*The Bigtable bells and whistles: optimized for scans, not random access; rudimentary indexes; fast sorts.
*Uses Zookeeper (w00t!)
*Says that retrieving 500 rows takes 30 ms -- however the total data size is about 1k (for all the rows). Not clear if this average/best/99.9%, but still an major increase over the same op in previous versions.
*Well supported operationally -- monitoring, rolling upgrades, etc.
*No cross-datacenter replication -- next version.
Hypertable
Doug Judd, ZeventsHypertable is big table, built on HDFS. It's basically the same as Hbase, but written in C++. This is because, according to Doug, Java is not suitable for high-performance, memory intensive applications. Zevents and Baidu use Hypertable extensively; Zevents for offline crawling and processing.
Hypertable is cool, but it is very similar in features to other Bigtable projects, and I was starting to get tired. If you have a deep love for C++, Hypertable may be your thing. But then, HDFS is written in Java. At least it's not Erlang. ;)
Notes:
*Has request throttling to protect performance
*Group commits
*Failure inducer class
CouchDB
J. Chris Anderson, Couch.ioIf Cassandra was the belle of the ball, CouchDB was definitely the jester. While I'm not totally sure I agree with what they are trying to do, it's totally cool seeing someone blast away at the assumptions behind datastores in general and try to build something expressly for the web. Also, I totally recommend seeing J. Chris speak -- he's got some very interesting ideas and is both clear and entertaining.
CouchDB approaches many of the problems that we are dealing with by avoiding them. For example, the focus is totally on availability; if two records are inconsistent, it just returns both with the query and lets the app deal with it. Data is stored in append-only files -- I didn't totally catch this, but I believe the entire structure is
rewritten to disk if there are any changes with a checkpoint that allows quick recovery. The logs are periodically compacted for space. J. Chris said CouchDB should perform at about 80% of the disk performance. While time will tell about their design decisions, simpler is definitely better.
Data is stored in schema free JSON. You can have CouchDB emit HTML or RSS/atom -- this reduces deployment complexity--no web serving tier. I think we should explore this in Sherpa -- it's a really good idea, but may be an oversimplification for what our users need to do. The implementation is in Erlang. Map-reduce views are available (and can be written in JavaScript, yikes!) They are also working on a JavaScript port of the main engine. Why? I'll tell you why!
CouchDB is designed to run everywhere -- instead of accessing data on a remote server, you just access it locally --on your phone, mac, or toaster. All the users of the application maintain their own datastore which replicated very cheaply via HTTP. Super low-latency with offline access. Again, I don't know if this will work (J. Chris said that his approach to replication might be a bit simplistic), but it's a neat idea.
Meebo has production Couch serving deployment called Lounge. I forget what they use it for.
If any of this sounds interesting, I recommend J. Chris's blog -- it explains where Couch is coming from/going to much better than I could.
Lightning Talks
After CouchDB, we had a few more quick talks. I'll list these with a few comments:
*Vpork: performance tester for key-values stores. Something that would be nice to see more of.
*MongoDB: As someone
who has done their share of database demos, I can tell you that the MongoDB demo wasn't super interesting. However, the database itself is very cool -- they are trying to do what Sherpa is doing -- provide a scalable system with as many querying
capabilities as possible.
*Google Megastore: Building secondary indexes on Bigtable. The presenter said that "write-read" consistency was actually really important for web applications. (Consider the case of writing a status update, and not being able to read it.) Dynamo does not provide this level of consistency and megastore provided consistency across Bigtable entity groups.
I also recommend the #nosql tag on twitter. You'll find links to videos from the event and ongoing conversation.
Toby Negrin
Sherpa Product Manager
Posted at June 19, 2009 11:28 AM | Permalink
Comments
One correction: Rackspace is sponsoring Cassandra development, not Rackable. :)
Good notes!
Posted by: Jonathan Ellis at June 19, 2009 9:19 PM
Nice write up. I just wanted to mention that the Voldemort integration with Hadoop is quite real, not just a plan. I didn't have time to go into a lot of details in the presentation. You can read about it here:
Posted by: Jay Kreps at June 19, 2009 10:12 PM
Videos, slides etc are up here: http://blog.oskarsson.nu/2009/06/nosql-debrief.html
Also worth joining this mailing list if you want to discuss these technologies in general: http://groups.google.com/group/nosql-discussion
/Johan
Posted by: Johan Oskarsson at June 20, 2009 2:34 AM
Hey Toby,
thanks for sharing your notes.
Cheers
Jan
--
Posted by: Jan at June 20, 2009 3:02 AM
Jonathan, Fixed it, thank you!
Posted by: Havi Hoffman at June 20, 2009 8:56 AM
Hi! Great post, thank you for sharing!
Only just a question: what about Tokyo Cabinet / Tyrant? There wasn't any person related with that project?
Posted by: Fernando Blat at July 4, 2009 9:03 AM
wade very carefully into alternative datastores
a datastore is worthless if it is unavailable or becomes corrupt...many of the technologies listed above are years away from being bulletproof. keep in mind that most of the firms involved in the above efforts continue to keep traditional rdbm systems in their critical path with no immediate plans to move away from them
most of the nosql people also seem to be ignoring drizzle. the whole point of drizzle is to meet the needs of the web stack by removing features that tend to complicate the model and reduce performance
most users can get where they need to be by modifying their use of an rdbms, not replacing it. simplify schemas. denormalize. don't join if you can help it. get faster boxes. load up the boxes with fast ram. use ssds. use memcache. my guess is that those steps would address most needs. surely people will scream when i say "get faster boxes"...i defy anyone to build a case for build vs buy. i can buy three insane boxes for 50k. it would five times that for most people to realistically rip out mysql
Posted by: Abe Bee at July 20, 2009 8:51 AM
@Abe Bee - While I agree that these products are all immature, most of the ones listed are solving real problems at top-ranked websites right now. They are getting better at an exponential rate, so they are only a year or two away from being "enterprise ready".
Databases work well at small scales, but have problems scaling. Everyone goes thru the same routine: First you denormalize, then you add a caching layer (like memcached), do replication, start sharding etc. These all take a *LOT* of work. If you fully de-normalize, you get a KV store. If you use memcached, you're already dealing with a KV store. If you replicate, you either loose high-availability (sync) or consistency (async). If you shard, your application gets complex fast. On top of all this, and you loose the ability to do joins, to have transactions, etc. Quite often, you loose the ability to query your data (in a reasonable time) because of sheer size. You think "Why are we using a database again?"
Instead, these next-gen databases start off being fully denormalized. Some have fast performance (5ms vs 200ms). Some scale to hundreds of nodes. Some can work during network partitions (very handy if you have 100's of nodes). Some do sharding (and resharding during failures) internally, instead of making it the application's problem.
"i can buy three insane boxes for 50k."
Agreed. That will solve some subset of problems, but not the problems of the top websites. LinkedIn, Facebook, etc all passed that bar years ago. And we all know that the "top websites" of today will be "ordinary websites" tomorrow. (Remember when the internet had 10 million documents and only a few special search engines could handle that much data? Now those 10 million documents could fit on your 3 insane boxes.)
Posted by: Anonymouse at August 28, 2009 8:20 AM
Post a comment
Comment Policy: We encourage comments and look forward to hearing from you. Please note that Yahoo! may, in our sole discretion, remove comments if they are off topic, inappropriate, or otherwise violate our Terms of Service. Fields marked with asterisk '*' are required.
Subscribe
Recent Blog Articles
view all
YQL Open Table for Google Buzz now live
Tue, 09 Feb 2010
INSERT INTO twitter.status ...
Mon, 08 Feb 2010
Announcing the Yahoo! Brasil Open Hack Day 2010, 20-21 March
Mon, 08 Feb 2010
Marketing hacks, linchpins, and tech women of valor
Sun, 07 Feb 2010
Yahoo! India invites you to join the first India Hadoop Summit
Thu, 04 Feb 2010
Recent Links
Appcelerator Titanium + Yahoo YQL on Vimeo
Mon, 08 Feb 2010
Tue, 02 Feb 2010
PhoneGap | Cross platform mobile framework
Sat, 30 Jan 2010
Web developers can rule the iPad - O'Reilly Radar
Sat, 30 Jan 2010
rc3.org - Is the iPad the harbinger of doom for personal computing?
Thu, 28 Jan 2010
Archives
2010
2009
2008
2007
2006
2005
Recent Readers

