I have roughly 100k (fairly short) unicode strings and for each of them a corresponding integer.
I need to store them in a dicionary-like persistent object that will be then accessed only for reading.
I'm looking for a solution that doesn't fill the ram loading the whole structure and not too much greedy of space on disk.
I already tried dbm and shelve, but I got a resulting file of 30Mb+.
I'm sure there are tools which fits better on this specific case, so any pointers (for both python2 or python3) are welcome.
Index
You may want to build an index.
The database is a file with each string in a line(or other dividing character)and the number at the end of the line. The strings are sorted. You can do binary search. You can use merge sort to build the structure. Merge two index files into one.
The cost:
O(n*log(n))
writes of each mapping during the creation
O(log(n))
lookups during search (lots of searches take long on a spinning disk)
O(max_string_length)
memory usage
We built such a thing in Java for a Search Engine Seminar. It can grow to gigabytes and still answer fast. In this context it is called an inverted index.
directory structure
You can create a directory for the string and in this directory a string with the number as its name. If you are lucky, it will not take 4kb for each entry. I do not know.