java hadoop mapreduce elastic-map-reduce emr

Best way to have a fast access key-value storage for huge dataset (5 GB)

There is a dataset of ~5GB in size. This big dataset just has a key-value pair per line. Now this needs to be read for the value of keys some billion times.

I have already tried disk based approach of MapDB, but it throws ConcurrentModification Exception and isn't mature enough to be used in production environment yet.

I also don't want to have it in a DB and make the call billion times (Though, certain level of in-memory caching can be done here).

Basically, I need to access these key-value dataset in mapper/reducer of a hadoop's job step.

Solution

So after trying out a bunch of things we are now using SQLite.

Following is what we did:

We load all the key-value pair data in a pre-defined database file (Indexed it on the key column, though it increased the file-size but was worth it.)
Store this file (key-value.db) in S3.
Now this is passed to the hadoop jobs as distributed cache.
In Configure of Mapper/Reducer the connection is opened (It takes around 50 ms) to the db file
In map/reduce method query this db with the key (It took negligible time, didn't even need to profile it, it was so negligible!)
Closed the connection in cleanup method of Mapper/Reducer