Search code examples
python-3.xgensimldatopic-modelingindex-error

IndexError when trying to update gensim's LdaModel


I am facing the following error when trying to update my gensim's LdaModel:

IndexError: index 6614 is out of bounds for axis 1 with size 6614

I checked why were other people having this issue on this thread, but I am using the same dictionary from the beginning to the end, which was their error.

As I have a big dataset, I am loading it chunk by chunk (using pickle.load). I am building the dictionary in this way, iteratively, thanks to this piece of code : 

 

 fr_documents_lda = open("documents_lda_40_rails_30_ruby_full.dat", 'rb')
 dictionary = Dictionary()
 chunk_no = 0
 while 1:
     try:
         t0 = time()
         documents_lda = pickle.load(fr_documents_lda)
         chunk_no += 1
         dictionary.add_documents(documents_lda)
         t1 = time()
         print("Chunk number {0} took {1:.2f}s".format(chunk_no, t1-t0))
     except EOFError:
         print("Finished going through pickle")
         break

Once built for the whole dataset, I am training the model in the same fashion, iteratively, this way :

fr_documents_lda = open("documents_lda_40_rails_30_ruby_full.dat", 'rb')
first_iter = True
chunk_no = 0
lda_gensim = None
while 1:
    try:
        t0 = time()
        documents_lda = pickle.load(fr_documents_lda) 
        chunk_no += 1
        corpus = [dictionary.doc2bow(text) for text in documents_lda]
        if first_iter:
            first_iter = False
            lda_gensim = LdaModel(corpus, num_topics=no_topics, iterations=100, offset=50., random_state=0, alpha='auto')
        else:
            lda_gensim.update(corpus)
        t1 = time()
        print("Chunk number {0} took {1:.2f}s".format(chunk_no, t1-t0))
    except EOFError:
        print("Finished going through pickle")
        break

I also tried updating the dictionary at every chunk, i.e. having  

dictionary.add_documents(documents_lda)

right before  

corpus = [dictionary.doc2bow(text) for text in documents_lda]

 in the last piece of code. Finally, I tried setting the allow_update argument of doc2bow to True. Nothing works.

FYI, the size of my final dictionary is 85k. The size of my dictionary built only from the first chunk is 10k. The error occurs on the second iteration, when it passes in the else condition, when calling the update method.

The error is raised by the line expElogbetad = self.expElogbeta[:, ids] , called by gamma, sstats = self.inference(chunk, collect_sstats=True), itself called by gammat = self.do_estep(chunk, other), itself called by lda_gensim.update(corpus).

Is anyone having an idea on how to fix this, or what is happening ?

Thank you in advance.


Solution

  • The solution is simply to initialize the LdaModel with the argument id2word = dictionary.

    If you don't do that, it assumes that your vocabulary size is the vocabulary size of the first set of documents you train it on, and can't update it. In fact, it sets its num_terms value to the length of id2word once there, and never updates it afterwards (you can verify in the update function).