Search code examples
pythonpython-3.xnosql

Efficient way to set up a persistent read-only dictionary with many keys but simple values


I have roughly 100k (fairly short) unicode strings and for each of them a corresponding integer.

I need to store them in a dicionary-like persistent object that will be then accessed only for reading.

I'm looking for a solution that doesn't fill the ram loading the whole structure and not too much greedy of space on disk.

I already tried dbm and shelve, but I got a resulting file of 30Mb+.

I'm sure there are tools which fits better on this specific case, so any pointers (for both python2 or python3) are welcome.


Solution

  • Index
    You may want to build an index.

    The database is a file with each string in a line(or other dividing character)and the number at the end of the line. The strings are sorted. You can do binary search. You can use merge sort to build the structure. Merge two index files into one.

    The cost:

    O(n*log(n)) writes of each mapping during the creation
    O(log(n)) lookups during search (lots of searches take long on a spinning disk)
    O(max_string_length) memory usage

    We built such a thing in Java for a Search Engine Seminar. It can grow to gigabytes and still answer fast. In this context it is called an inverted index.

    directory structure
    You can create a directory for the string and in this directory a string with the number as its name. If you are lucky, it will not take 4kb for each entry. I do not know.