Search code examples
databaseriakleveldb

Riak backend choice: bitcask vs leveldb


I'm planning to use Riak as a backend for a service that stores user session data. The main key used to retrieve data (binary blob) is named UUID and actually is a uuid, but sometimes the data might be retrieved using one or two other keys (e.g. user's email).

Natural option would be to pick leveldb backend with possibility to use secondary indexes for such scenario, but as secondary index search is not very common (around 10% - 20% of lookups), I was wondering if it wouldn't be better to have a separate "indexes" bucket, where such mapping email->uuid would be stored.

In such scenario, when looking using "secondary" index, I would first lookup the uuid in the "indexes" bucket, and then normally read the data using primary key.

Knowing that bitcask is much more predictable when it comes to latency and possibly faster, would you recommend such design, or shall I stick to leveldb and secondary indexes?


Solution

  • I think that both scenario would work. One way to choose which scenario to use is if you need expiration. I guess you'll want to have expiration for user sessions. If that's the case, then I would go with the second scenario, as bitcask offers a very good expiration feature, fully customizable.

    If you go that path, you'll have to cleanup the metadata bucket (in eleveldb) that you use for secondary indexes. That can be done easily by also having an index of the last modification of the metadata keys. Then you run a batch to do a 2i query to fetch old metadata and delete them. Make sure you use the latest Riak, that supports aggressive deletion and reclaiming of disk space in leveldb.

    That said, maybe you can have everything in bitcask, and avoid secondary indexes altogether. Consider this data design:

    one "data" bucket: keys are uuid, value is the session one "mapping_email" bucket: keys are email, values are uuid one "mapping_otherstuff" bucket: same for other properties

    This works fine if :

    • most of the time you let your data expire. That means you have no bookkeeping to do
    • you don't have too many mapping as it's cumbersome to add more
    • you are ready to properly implement a client library that would manage the 3 buckets, for instance when creating / updating / deleting new values

    You could start with that, because it's easier on the administration, bookkeeping, batch-creation (none), and performance (secondary index queries can be expensive).

    Then later on if you need, you can add the leveldb route. Make sure you use multi_backend from the start.