I'm currently required to work on a multilingual text classification model where I have to classify whether two sentences in two languages are semantically similar. I'm also required to use Word2Vec for word embedding.
I am able to generate the word embedding using Word2Vec, however, when I'm trying to convert my sentences to vectors with a method similar to this. I get an error saying
KeyError: "word '' not in vocabulary"
Here is my code snippet
import nltk
nltk.download('punkt')
tokenized_text_data = [nltk.word_tokenize(sub) for sub in concatenated_text]
model = Word2Vec(sentences=tokenized_text_data, min_count=1)
# Error happens here
train_vectors = [model.wv[re.split(" |;", row)] for row in concatenated_text]
For context, concatenated_text is the sentences from two languages concatenated together with semi-colon as the delimiter. Hence, why the function re.split(" |;")
.
I guess the important thing now is to understand why the error is telling me that an empty string ''
is not in the vocabulary.
I did not provide the sentences cause the dataset is too big and I can't seem to find which word of which sentence is producing this error.
It turns out it was because of the delimiter that I concatenated myself all along. There are other semicolons in the sentence dataset, and with how re.split(" |;")
works, it will split the sentence such as ice cream ; bread ; milk
into a list of ['ice', 'cream', '', '', 'bread', '', '', 'milk']
. Hence why the error word '' not in vocabulary
.
I hope this would benefit someone in the future!