I'm currently trying to make a sentiment analysis on the IMDB review dataset as a part of homework assignment for my college, I'm required to firstly do some preprocessing e.g. : tokenization, stop words removal, stemming, lemmatization. then use different ways to convert this data to vectors to be classfied by different classfiers, Gensim FastText library was one of the required models to obtain word embeddings on the data I got from text pre-processing step.
the problem I faced with Gensim is that I firstly tried to train on my data using vectors of feature size (100,200,300) but yet they always fail at some point, I tried later to use many pre-trained Gensim data vectors, but none of them worked to find word embeddings for all of the words, they'd rather fail at some point with error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-28-644253c70aa3> in <module>()
----> 1 model.most_similar(some-rare-word)
1 frames
/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
450 return result
451 else:
--> 452 raise KeyError("word '%s' not in vocabulary" % word)
453
454 def get_vector(self, word):
KeyError: "word some-rare-word not in vocabulary"
the ones I've tried so far are :
conceptnet-numberbatch-17-06-300 : doesn't contain "glass"
word2vec-google-news-300 : ram insufficient in Google Colab
glove-twitter-200 : doesn't contain "5"
crawl-300d-2M : doesn't contain "waltons"
wiki-news-300d-1M : doesn't contain "waltons"
glove-wiki-gigaword-300 : doesn't contain "riget"
got their names from these sources, here and here
by inspecting the failing words, I found that even the largest libraries would usually fail because of the misspelled words that has no meaning like 'riget', 'waltons',...etc
Is their a way to classify and neglect this strange words before trying to inject them to Gensim and receiving this error ? or am I using Gensim very wrong and there's another way to use it ?
any snippet of code or some sort of lead on what to do would be appreciated
my code so far :
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-300") # this can be any vector-library of the previously mentioned ones
train_word_embeddings = []
# train_lemm is a vector of size (number of examples, number of words remaining in example sentence i after removal of stop words and lemmatization to nouns)
# they're mainly 25000 example review sentence while the second dimension is of random value depending on number of words
for i in range (len(train_lemm)):
train_word_embeddings.append(model.wv[train_lemm[i]])
If you train your own word-vector model, then it will contain vectors for all the words you told it to learn. If a word that was in your training data doesn't appear to have a vector, it likely did not appear the required min_count
number of times. (These models tend to improve if you discard rare words who few example usages may not be suitably-informative, so the default min_words=5
is a good idea.)
It's often reasonable for downstream tasks, like feature engineering using the text & set of word-vectors, to simply ignore words with no vector. That is, if some_rare_word in model.wv
is False
, just don't try to use that word – & its missing vector – for anything. So you don't necessarily need to find, or train, a set of word-vectors with every word you need. Just elide, rather than worry-about, the rare missing words.
Separate observations:
supervised
classification mode, but it's not supported by Gensim. If you want to experiment with that, you'd need to use the Facebook FastText implementation. (You could still use a traditional, non-supervised
FastText word vector model as a contributor of features for other possible representations.)