I'm running 64-bit Python 3 on Linux, and I have a code that generates lists with about 20,000 elements. A memory error occurred when my code tried to write a list of ~20,000 2D arrays to a binary file via the pickle
module, but it generated all of these arrays and appended them to this list without a problem. I know this must take up a lot of memory, but the machine I'm using has about 100GB available (from the command free -m
). The line with the error:
with open('all_data.data', 'wb') as f:
pickle.dump(data, f)
>>> MemoryError
where data
is my list of ~20,000 numpy arrays. Also, previously I was trying to run this code with about 55,000 elements, but while it was 40% of the way through with appending all the arrays to the data
list, it just output Killed
by itself. So now I'm trying to break it into segments, but this time I get a MemoryError. How can I bypass this? I was also informed that I have access to multiple CPUs, but I have no idea how to take advantage of these (I don't yet understand multiprocessing
).
Pickle will try to parse all your data, and likely convert it to intermediate states before writing everything to disk - so if you are using about half your available memory, it will blow up.
Since your data is already on a list, an easy workaround there is to pickle each array, and store it, instead of trying to serialize the 20000 arrays in a single go:
with open('all_data.data', 'wb') as f:
for item in data:
pickle.dump(item, f)
Then, to read it back, just keep unpickling objects from your file and appending then to a list, until the file is exhausted:
data = []
with open('all_data.data', 'rb') as f:
while True:
try:
data.append(pickle.load(f))
except EOFError:
break
This works because unpicking from a file is quite well behaved: the file pointer stays exactly at the point a pickled object stored in the file ends - further reads therefore start at the beginning of the next object.