Search code examples
pythondictionarybigdatapicklelarge-data

How to write to Python dictionary without loading the dictionary into memory?


I have a large table that I want to convert to a Python dictionary but I don't want to load all of the data into memory.

Is it possible to actively write to a pickle dump without building the object first?

For example:

import gzip
f_out = open("output.dict.pkl.gz", "wb")

with open("table.tsv", "r") as f_in:
    for line in f_in:
        line = line.strip()
        if line:
            fields = line.split("\t")
            k = fields[3]
            v = fields[1]

            # Pseudocode
            f_out[k] = v # I know this won't work but just so you can see my goal

# Close the pickle file
f_out.close()

Solution

  • Since your keys are strings, you can use the shelve module to make a dict-like object that's backed by a minimalist database, where the keys are strings, and the values are individually pickled values. You should also use the csv module to parse TSV data properly:

    import csv
    import shelve
    
    with open("table.tsv", newline="") as f_in, shelve.open("output.db") as shelf:
        for row in csv.reader(f_in, delimiter='\t'):
            if row:
                k = row[3]
                v = row[1]
                shelf[k] = v
    

    Importantly, this means when you load it later to read a handful of keys, you don't need to load the whole thing into memory then either.