Search code examples
databasenosqlkey-value-storegraph-databasesdistributed-database

What scalability problems have you encountered using a NoSQL data store?


NoSQL refers to non-relational data stores that break with the history of relational databases and ACID guarantees. Popular open source NoSQL data stores include:

  • Cassandra (tabular, written in Java, used by Cisco, WebEx, Digg, Facebook, IBM, Mahalo, Rackspace, Reddit and Twitter)
  • CouchDB (document, written in Erlang, used by BBC and Engine Yard)
  • Dynomite (key-value, written in Erlang, used by Powerset)
  • HBase (key-value, written in Java, used by Bing)
  • Hypertable (tabular, written in C++, used by Baidu)
  • Kai (key-value, written in Erlang)
  • MemcacheDB (key-value, written in C, used by Reddit)
  • MongoDB (document, written in C++, used by Electronic Arts, Github, NY Times and Sourceforge)
  • Neo4j (graph, written in Java, used by some Swedish universities)
  • Project Voldemort (key-value, written in Java, used by LinkedIn)
  • Redis (key-value, written in C, used by Craigslist, Engine Yard and Github)
  • Riak (key-value, written in Erlang, used by Comcast and Mochi Media)
  • Ringo (key-value, written in Erlang, used by Nokia)
  • Scalaris (key-value, written in Erlang, used by OnScale)
  • Terrastore (document, written in Java)
  • ThruDB (document, written in C++, used by JunkDepot.com)
  • Tokyo Cabinet/Tokyo Tyrant (key-value, written in C, used by Mixi.jp (Japanese social networking site))

I'd like to know about specific problems you - the SO reader - have solved using data stores and what NoSQL data store you used.

Questions:

  • What scalability problems have you used NoSQL data stores to solve?
  • What NoSQL data store did you use?
  • What database did you use prior to switching to a NoSQL data store?

I'm looking for first-hand experiences, so please do not answer unless you have that.


Solution

  • I've switched a small subproject from MySQL to CouchDB, to be able to handle the load. The result was amazing.

    About 2 years ago, we've released a self written software on http://www.ubuntuusers.de/ (which is probably the biggest German Linux community website). The site is written in Python and we've added a WSGI middleware which was able to catch all exceptions and send them to another small MySQL powered website. This small website used a hash to determine different bugs and stored the number of occurrences and the last occurrence as well.

    Unfortunately, shortly after the release, the traceback-logger website wasn't responding anymore. We had some locking issues with the production db of our main site which was throwing exceptions nearly every request, as well as several other bugs, which we haven't explored during the testing stage. The server cluster of our main site, called the traceback-logger submit page several k times per second. And that was a way too much for the small server which hosted the traceback logger (it was already an old server, which was only used for development purposes).

    At this time CouchDB was rather popular, and so I decided to try it out and write a small traceback-logger with it. The new logger only consisted of a single python file, which provided a bug list with sorting and filter options and a submit page. And in the background I've started a CouchDB process. The new software responded extremely quickly to all requests and we were able to view the massive amount of automatic bug reports.

    One interesting thing is, that the solution before, was running on an old dedicated server, where the new CouchDB based site on the other hand was only running on a shared xen instance with very limited resources. And I haven't even used the strength of key-values stores to scale horizontally. The ability of CouchDB / Erlang OTP to handle concurrent requests without locking anything was already enough to serve the needs.

    Now, the quickly written CouchDB-traceback logger is still running and is a helpful way to explore bugs on the main website. Anyway, about once a month the database becomes too big and the CouchDB process gets killed. But then, the compact-db command of CouchDB reduces the size from several GBs to some KBs again and the database is up and running again (maybe i should consider adding a cronjob there... 0o).

    In a summary, CouchDB was surely the best choice (or at least a better choice than MySQL) for this subproject and it does its job well.