Search code examples
pythonjsondictionarybigdatapickle

Fast way to save/load giant Python dictionaries (~50GB)?


I have a bunch of data I need to store in memory as a python dictionary for O(1) fast lookups (millions of key,value pairs). I've been using the json library, which saved the data to a 50GB json file and takes 45min to load every time I start my program. There's gotta be a faster way...

The keys are Int64s (e.g. 1101101000011011011010000110010101011101000101010101011101000101). The bits are meaningful, so I can't use different keys.

The values are lists of 3 Float32 lists (e.g. [[-34.83263, 46.90836, 66.2], [14.23263, -76.90836, 310.4], ... ]). The outer lists are of variable length. The inner lists always have 3 floats.

It doesn't need to be saved as human-readable, so I've looked into using pickle instead. Is pickle going to be best for this or is there something more efficient? Are there better ways to store this much data and load it quickly into a python dictionary when I run my program?


Solution

  • The simplest solution is probably to switch to a key-value database instead of loading the entire dictionary on every start. Python provides those in the dbm module. Using shelve would be even easier but I find the pickle overhead significant in this use case.

    dbm uses str or bytes strings for keys and bytes for values. We have to adapt for it. The key is simple enough: We convert with hex(key). str(key) would also work but the higher information density is preferable.

    For the value, using lists of lists is very wasteful. Numpy arrays are faster and can be converted to and from raw bytes. A more universal serialization may want to include some headers for the shape and datatype but here, since we know the format, we can keep things very simple.

    Converting for storage: value = np.asarray(value).tobytes(). Converting from storage: value = np.frombuffer(value).reshape((-1, 3))

    With this concept in hand, we can wrap the dbm interface in a class that behaves roughly like a dictionary.

    class DbDict:
        """Dictionary-like key-value store
    
        Maps int keys to 2D arrays of floats with N rows and 3 columns
    
        Attributes:
        dbhandle -- dbm handle
        """
        def __init__(self, dbhandle):
            """Takes ownership of the given handle"""
            self.dbhandle = dbhandle
    
        @classmethod
        def open(cls, filepath, mode='r'):
            """Opens a database with the given mode"""
            return cls(dbm.open(filepath, mode))
    
        @classmethod
        def create(cls, filepath):
            """Creates a new empty database"""
            return cls(dbm.open(filepath, 'n'))
    
        def __enter__(self):
            """Implements context manager to close database"""
            return self
    
        def __exit__(self, type_, value, traceback):
            """Implements context manager to close database"""
            self.dbhandle.close()
    
        def __len__(self):
            """Number of keys"""
            return len(self.dbhandle)
    
        def __getitem__(self, key):
            """Retrieves key"""
            return np.frombuffer(self.dbhandle[hex(key)]).reshape((-1, 3))
    
        def __setitem__(self, key, value):
            """Overwrites key"""
            self.dbhandle[hex(key)] = np.asarray(value).tobytes()
    
        def __delitem__(self, key):
            """Removes key"""
            del self.dbhandle[hex(key)]
    

    To test this, we have to define some test data. I simply create a dictionary mapping 10 million consecutive integers to numpy arrays shaped (1, 3). I don't think using random integers will have a significant effect on performance in this particular case. Using just 3 floats per key gives the highest key-to-value ratio and thereby tests the worst case.

    def gen_data(count=10_000_000):
        values = np.random.random((count, 1, 3))
        return dict(enumerate(values))
    

    This consumes about 2 GiB of memory. If we want to store for json, we have to convert the numpy arrays into simple lists. This doubles the memory usage. Things would be worse if we used more values per key.

    def to_simple_format(data):
        return {key: [[float(item) for item in row]
                      for row in value]
                for key, value in data.items()}
    

    Now a simple test for storing and loading:

    def to_json(data, filepath):
        with open(filepath, 'w') as outfile:
            json.dump(data, outfile)
    
    
    def from_json(filepath):
        with open(filepath, 'r') as infile:
            data = json.load(infile)
        # json always converts keys to string. We have to revert that,
        return {int(key): value for key, value in data.items()}
    
    
    def to_compact_format(data):
        asarray = np.asarray
        return {key: asarray(value) for key, value in data.items()}
    
    
    def lookup_dict(data, count=20_000_000):
        keys = np.random.randint(0, len(data), count)
        for key in keys:
            data[key]
    
    
    def main():
        data = gen_data()
        data = to_simple_format(data)
        to_json(data, "foo.json")
        data = from_json("foo.json")
        data = to_compact_format(data)
        lookup_dict(data)
    

    Storing takes 1:15 minutes on my system. Loading and doing 20 million random lookups takes 33 seconds. Of those, 25 seconds are just loading and converting to the numpy format.

    This is the same routine using the dbm interface:

    def to_dbm(data, filepath):
        with DbDict.create(filepath) as db:
            for key, value in data.items():
                db[key] = value
    
    
    def main():
        data = gen_data()
        to_dbm(data, "foo.db")
        with DbDict.open("foo.db") as data:
            lookup_dict(data)
    

    Storing takes about the same time but it now spends most of its time doing tiny IO operations instead of float-string conversions so maybe there is a way to optimize this. The file itself slightly larger.

    Lookup is faster overall with 22 seconds. However, now we moved all the cost from loading into the actual dictionary lookup. We can never be faster than a simple dict lookup so as the number of lookups increases, performance will degrade over the original design. On the other hand you gain several advantages:

    • The size of the database can grow bigger than your memory (dbm does some caching but this should be capped at a reasonable size)
    • There is no significant initial loading cost. Perfect if you only look up a fraction of the keys
    • You can overwrite values without writing the whole thing again