Search code examples
nlpstanford-nlpword2vecdoc2vec

How to train word2vec with your own vocab


I am getting error while training word2vec with my own vocabulary. I am also not getting why its happening.

Code:

from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = word2vec.LineSentence('test_data')

model = word2vec.Word2Vec(sentences, size=20)
model.build_vocab(sentences,update=True)
model.train(sentences)

print model.most_similar(['course'])

It throws an error

2017-08-27 16:50:04,590 : INFO : precomputing L2-norms of word weight vectors
Traceback (most recent call last):
  File "tryword2vec.py", line 23, in <module>
    print model.most_similar(['course']) 
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 1285, in most_similar
    return self.wv.most_similar(positive, negative, topn, restrict_vocab, indexer)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 97, in most_similar
    raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word 'course' not in vocabulary"

test_data contains:

Bachelor of Engg is a course. M.Tech is a course. ME is a course. B.Tech is a course. Bachelor of Arts is a course. Fashion Design is a course. Multimedia is a course. Mechanical engg is a course. Computer Science is a course. Electronics is a cource. Engineering is a course. MBA is a course. BBA is a course.

Any help is appreciated?


Solution

  • The reason you are not getting the error is because the word course is not in the vocabulary. Instead the word present is course.

    There is a period "." at the end of course.

    check your vocabulary model.wv.vocab

    {u'a': <gensim.models.keyedvectors.Vocab at 0x7fe086c461d0>,
     u'course.': <gensim.models.keyedvectors.Vocab at 0x7fe0b4704f90>,
     u'is': <gensim.models.keyedvectors.Vocab at 0x7fe086ba0d10>}
    

    And do hide your api keys.