Search code examples
pythonserializationpickleinverted-index

Using cPickle to serialize a large dictionary causes MemoryError


I'm writing an inverted index for a search engine on a collection of documents. Right now, I'm storing the index as a dictionary of dictionaries. That is, each keyword maps to a dictionary of docIDs->positions of occurrence.

The data model looks something like: {word : { doc_name : [location_list] } }

Building the index in memory works fine, but when I try to serialize to disk, I hit a MemoryError. Here's my code:

# Write the index out to disk
serializedIndex = open(sys.argv[3], 'wb')
cPickle.dump(index, serializedIndex, cPickle.HIGHEST_PROTOCOL)

Right before serialization, my program is using about 50% memory (1.6 Gb). As soon as I make the call to cPickle, my memory usage skyrockets to 80% before crashing.

Why is cPickle using so much memory for serialization? Is there a better way to be approaching this problem?


Solution

  • cPickle needs to use a bunch of extra memory because it does cycle detection. You could try using the marshal module if you are sure your data has no cycles