Search code examples
pythonmemorysetcompressiontrie

RAM-efficient set of 1 billion of short strings, and fast lookup


I want to be able to test, as fast as possible, if a given string is a member of a set S of 1 billion of strings:

  • each string is of length 10 on average, so the whole data is ~ 10 GB, the max length is 100

  • 99% of the strings uses only a-z A-Z 0-9 + a few special chars (.!-;\/), in some rare cases there are other chars

  • if necessary, the data could be sorted:

    aaaaa
    ...
    helloworld
    hellu
    help
    ...
    zzzzz
    

    and we could maybe use a prefix tree / trie, but I'm not sure if it's the best solution here.

Question: Is there a way to work with an in-memory compressed set with Python, possibly with the standard library, with fast lookup?

The goal is to test 'hello' in S as fast as possible.

I know how to do this with 10 GB of RAM, but out of curiosity, I'd like to know if this is possible with, say, only 1 or 2 GB of RAM:

S = set()
with open('db.txt', 'r') as f:
   for l in f:
       S.add(l)
print('hello' in S)

Solution

  • No.

    Unless there is some massive redundancy in the data you're not telling us about, you can expect to compress the 10 GB down to maybe 6 or 7 GB, even sorted. A data structure like you suggest would make it massively larger, not smaller.

    As suggested in another answer, you can make it fast with a hashing approach. But not reduce the storage by a factor of five to ten.