Search code examples
pythonmachine-learningnlpword2vec

training custom word2vec model


i have my own dataset in which i want to use gensim word2vec to train but i'm not sure how to do it.

from google.colab import files
import io
uploaded = files.upload()
data_path = 'chatbot_dataset.txt'
with open(data_path, 'r') as f:
    lines = f.read().split('\n')

for line in lines:
    input_text = line.split('\t')[0]
    if len(input_text.split()) > MAX_SENTENCE_LENGTH:
      break
    target_text = '<START> ' + line.split('\t')[1] + " <END>"
    input_texts.append(input_text)
    target_texts.append(target_text)

model = Word2Vec(lines, min_count=1,workers=3,size=100,window=3,sg=1)
model.wv.get_vector('hello')

but i got this error while doing it, even though the word 'hello' is already in my dataset:

KeyError                                  Traceback (most recent call last)
<ipython-input-15-b41c8cb17d3b> in <module>()
    140 model.wv.vector_size
    141 #check out how 'PEM' is represented in an array of 100 numbers
--> 142 model.wv.get_vector('hello')
    143 #find words with similar meaning to 'PEN'
    144 model.wv.most_similar('to')

1 frames
/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
    450             return result
    451         else:
--> 452             raise KeyError("word '%s' not in vocabulary" % word)
    453 
    454     def get_vector(self, word):

KeyError: "word 'hello' not in vocabulary"

Solution

  • You're feeding lines, which appears to be a list of plain strings, to Word2Vec.

    Word2Vec is instead expecting a re-iterable sequence of items, where each item is a pre-tokenized list-of-strings. By passing it a sequence of plain strings instead, when Word2Vec interprets one string as a list, it will see it as a list-of-single-characters – so the entire set of 'words' it learns will just be single-characters. (There may have been a warning in your logs about that, or if you were running with at least INFO logging, progress-reporting that shows a suspiciously-tiny number of discovered unique words.)

    You can look at what your model's volcabulary wound up being by examining model.wv.index_to_key - for example, peeking at the 10 most-common words found by print(model.wv.index_to_key[:10]. If that doesn't look right, make sure you're properly preprocessing/tokenizing the corpus you'll be handing to Word2Vec.

    Separately: min_count=1 is never a good idea with Word2Vec. Only words with multiple varied usage examples can achieve useful word-vectors, and usually discarding the rarest words, as with the default min_count=5, ensures the best-quality vectors for all surviving words. (If there are words with fewer than 5 usage examples for which you need vectors, the best approach is obtain more varied-usage training data.)