I can save a serialized corpus into foobar.mm
but when i try to load it, it gives UnpicklingError
. Loading the dictionary seems fine though. Anyone knows how to resolve this? And why does this occur?
>>> from gensim import corpora
>>> docs = ["this is a foo bar", "you are a foo"]
>>> texts = [[i for i in doc.lower().split()] for doc in docs]
>>> print texts
[['this', 'is', 'a', 'foo', 'bar'], ['you', 'are', 'a', 'foo']]
>>> dictionary = corpora.Dictionary(texts)
>>> dictionary.save('foobar.dic')
>>> print dictionary
Dictionary(7 unique tokens)
>>> corpora.Dictionary.load('foobar.dic')
<gensim.corpora.dictionary.Dictionary object at 0x329f910>
>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> corpora.MmCorpus.serialize('foobar.mm', corpus)
>>> corpus = corpora.MmCorpus.load('foobar.mm')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/gensim-0.8.6-py2.7.egg/gensim/utils.py", line 166, in load
return unpickle(fname)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.8.6-py2.7.egg/gensim/utils.py", line 492, in unpickle
return cPickle.load(open(fname, 'rb'))
cPickle.UnpicklingError: invalid load key, '%'.
See the documentation at http://radimrehurek.com/gensim/tut1.html#corpus-formats
What you're trying to do is store the corpus in MatrixMarket format (=a text format) and then load it using the save/load binary interface.
To load a serialized MatrixMarket corpus, simply corpus = corpora.MmCorpus('foobar.mm')