Search code examples
xmldatabasejsondata-storage

Most space efficient way to store millions of simple data?


My data looks like this:

00000000001 : `12341234...12341234'

Basically a unique id value associated with a big string of numbers (less than 100 chars).

I want to store 10's of millions and maybe even 100's of millions of these pieces of data, just IDs pointing to big number strings. I am wondering what the most space efficient way to store them is and I also want to keep in mind a quick look up time as well. I want my application to be given a number like 550,000 and be able to quickly reference the big string of numbers associated with it.

I have looked at open source DBs as an option (MySQL) and I also considered something like JSON or XML. Are there other options? What would be best?

The reason I am uncertain is because the data is so simple. I am afraid of using certain databases because some are relational or object oriented, but I don't have a need for those features (there might be overhead here). I am also afraid my data is too simple and repetitive for something like JSON too because I feel like much of the file space will be consumed by repeating "id" : and "bignumber" : over and over.

Any suggestions?


Solution

  • It looks like both id and value are integer values, so storing them as binary data (as opposed to strings) would save a lot of space. This rules out JSON or XML, which are text-based.

    I think you want to use a key-value store, such as BerkeleyDB. They allow fast lookup by key (but nothing else).

    Using something like SQLite would also have very little overhead and allow for convenient access methods.

    It would also be important that you can access the data without reading it completely into memory first (database engines manage that for you, with JSON or a hand-rolled format this can be a lot of work).

    If you do not need network access (but want to work on local files), an embedded database system like BerkeleyDB or SQLite seems to be the best fit. Not having a server also greatly reduces the setup overhead.