Search code examples
javahadoopmapreduceelastic-map-reduceemr

Best way to have a fast access key-value storage for huge dataset (5 GB)


There is a dataset of ~5GB in size. This big dataset just has a key-value pair per line. Now this needs to be read for the value of keys some billion times.

I have already tried disk based approach of MapDB, but it throws ConcurrentModification Exception and isn't mature enough to be used in production environment yet.

I also don't want to have it in a DB and make the call billion times (Though, certain level of in-memory caching can be done here).

Basically, I need to access these key-value dataset in mapper/reducer of a hadoop's job step.


Solution

  • So after trying out a bunch of things we are now using SQLite.

    Following is what we did:

    1. We load all the key-value pair data in a pre-defined database file (Indexed it on the key column, though it increased the file-size but was worth it.)
    2. Store this file (key-value.db) in S3.
    3. Now this is passed to the hadoop jobs as distributed cache.
    4. In Configure of Mapper/Reducer the connection is opened (It takes around 50 ms) to the db file
    5. In map/reduce method query this db with the key (It took negligible time, didn't even need to profile it, it was so negligible!)
    6. Closed the connection in cleanup method of Mapper/Reducer