I have a large table that I want to convert to a Python dictionary but I don't want to load all of the data into memory.
Is it possible to actively write to a pickle dump without building the object first?
For example:
import gzip
f_out = open("output.dict.pkl.gz", "wb")
with open("table.tsv", "r") as f_in:
for line in f_in:
line = line.strip()
if line:
fields = line.split("\t")
k = fields[3]
v = fields[1]
# Pseudocode
f_out[k] = v # I know this won't work but just so you can see my goal
# Close the pickle file
f_out.close()
Since your keys are strings, you can use the shelve
module to make a dict
-like object that's backed by a minimalist database, where the keys are strings, and the values are individually pickled values. You should also use the csv
module to parse TSV data properly:
import csv
import shelve
with open("table.tsv", newline="") as f_in, shelve.open("output.db") as shelf:
for row in csv.reader(f_in, delimiter='\t'):
if row:
k = row[3]
v = row[1]
shelf[k] = v
Importantly, this means when you load it later to read a handful of keys, you don't need to load the whole thing into memory then either.