Search code examples
pythonlatent-semantic-indexinggensim

LSI using gensim in python


I'm using Python's gensim library to do latent semantic indexing. I followed the tutorials on the website, and it works pretty well. Now I'm trying to modify it a bit; I want to be run the lsi model each time a document is added.

Here is my code:

stoplist = set('for a of the and to in'.split())
num_factors=3
corpus = []

for i in range(len(urls)):
 print "Importing", urls[i]
 doc = getwords(urls[i])
 cleandoc = [word for word in doc.lower().split() if word not in stoplist]
 if i == 0:
  dictionary = corpora.Dictionary([cleandoc])
 else:
  dictionary.addDocuments([cleandoc])
 newVec = dictionary.doc2bow(cleandoc)
 corpus.append(newVec)
 tfidf = models.TfidfModel(corpus)
 corpus_tfidf = tfidf[corpus]
 lsi = models.LsiModel(corpus_tfidf, numTopics=num_factors, id2word=dictionary)
 corpus_lsi = lsi[corpus_tfidf]

geturls is function I wrote that returns the contents of a website as a string. Again, it works if I wait until I process all of the documents before doing tfidf and lsi, but that's not what I want. I want to do it on each iteration. Unfortunately, I get this error:

    Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "streamlsa.py", line 51, in <module>
    lsi = models.LsiModel(corpus_tfidf, numTopics=num_factors, id2word=dictionary)
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 303, in __init__
    self.addDocuments(corpus)
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 365, in addDocuments
    self.printTopics(5) # TODO see if printDebug works and remove one of these..
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 441, in printTopics
    self.printTopic(i, topN = numWords)))
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 433, in printTopic
    return ' + '.join(['%.3f*"%s"' % (1.0 * c[val] / norm, self.id2word[val]) for val in most])
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/corpora/dictionary.py", line 52, in __getitem__
    return self.id2token[tokenid] # will throw for non-existent ids
KeyError: 1248

Usually the error pops up on the second document. I think I understand what it's telling me (the dictionary indices are bad), I just can't figure out WHY. I've tried lots of different things and nothing seems to work. Does anyone know what's going on?

Thanks!


Solution

  • This was a bug in gensim, where the reverse id->word mapping gets cached, but the cache didn't get updated after addDocuments().

    It got fixed in this commit in 2011: https://github.com/piskvorky/gensim/commit/b88225cfda8570557d3c72b0820fefb48064a049 .