Search code examples
pythonword2vec

Word2Vec empty word not in vocabulary


I'm currently required to work on a multilingual text classification model where I have to classify whether two sentences in two languages are semantically similar. I'm also required to use Word2Vec for word embedding.

I am able to generate the word embedding using Word2Vec, however, when I'm trying to convert my sentences to vectors with a method similar to this. I get an error saying

KeyError: "word '' not in vocabulary"

Here is my code snippet

import nltk
nltk.download('punkt')
tokenized_text_data = [nltk.word_tokenize(sub) for sub in concatenated_text]

model = Word2Vec(sentences=tokenized_text_data, min_count=1)

# Error happens here
train_vectors = [model.wv[re.split(" |;", row)] for row in concatenated_text]

For context, concatenated_text is the sentences from two languages concatenated together with semi-colon as the delimiter. Hence, why the function re.split(" |;").

I guess the important thing now is to understand why the error is telling me that an empty string '' is not in the vocabulary.

I did not provide the sentences cause the dataset is too big and I can't seem to find which word of which sentence is producing this error.


Solution

  • It turns out it was because of the delimiter that I concatenated myself all along. There are other semicolons in the sentence dataset, and with how re.split(" |;") works, it will split the sentence such as ice cream ; bread ; milk into a list of ['ice', 'cream', '', '', 'bread', '', '', 'milk']. Hence why the error word '' not in vocabulary.

    I hope this would benefit someone in the future!