Search code examples
pythongensimsentiment-analysisword-embedding

rare misspelled words messes my fastText/Word-Embedding Classfiers


I'm currently trying to make a sentiment analysis on the IMDB review dataset as a part of homework assignment for my college, I'm required to firstly do some preprocessing e.g. : tokenization, stop words removal, stemming, lemmatization. then use different ways to convert this data to vectors to be classfied by different classfiers, Gensim FastText library was one of the required models to obtain word embeddings on the data I got from text pre-processing step.

the problem I faced with Gensim is that I firstly tried to train on my data using vectors of feature size (100,200,300) but yet they always fail at some point, I tried later to use many pre-trained Gensim data vectors, but none of them worked to find word embeddings for all of the words, they'd rather fail at some point with error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-28-644253c70aa3> in <module>()
----> 1 model.most_similar(some-rare-word)

1 frames
/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
    450             return result
    451         else:
--> 452             raise KeyError("word '%s' not in vocabulary" % word)
    453 
    454     def get_vector(self, word):

KeyError: "word some-rare-word not in vocabulary"

the ones I've tried so far are :
conceptnet-numberbatch-17-06-300 : doesn't contain "glass"
word2vec-google-news-300 : ram insufficient in Google Colab
glove-twitter-200 : doesn't contain "5"
crawl-300d-2M : doesn't contain "waltons"
wiki-news-300d-1M : doesn't contain "waltons"
glove-wiki-gigaword-300 : doesn't contain "riget"

got their names from these sources, here and here

by inspecting the failing words, I found that even the largest libraries would usually fail because of the misspelled words that has no meaning like 'riget', 'waltons',...etc

Is their a way to classify and neglect this strange words before trying to inject them to Gensim and receiving this error ? or am I using Gensim very wrong and there's another way to use it ?

any snippet of code or some sort of lead on what to do would be appreciated

my code so far :

import gensim.downloader as api
model = api.load("glove-wiki-gigaword-300") # this can be any vector-library of the previously mentioned ones
train_word_embeddings = []
# train_lemm is a vector of size (number of examples, number of words remaining in example sentence i after removal of stop words and lemmatization to nouns)
# they're mainly 25000 example review sentence while the second dimension is of random value depending on number of words
for i in range (len(train_lemm)): 
  train_word_embeddings.append(model.wv[train_lemm[i]])

Solution

  • If you train your own word-vector model, then it will contain vectors for all the words you told it to learn. If a word that was in your training data doesn't appear to have a vector, it likely did not appear the required min_count number of times. (These models tend to improve if you discard rare words who few example usages may not be suitably-informative, so the default min_words=5 is a good idea.)

    It's often reasonable for downstream tasks, like feature engineering using the text & set of word-vectors, to simply ignore words with no vector. That is, if some_rare_word in model.wv is False, just don't try to use that word – & its missing vector – for anything. So you don't necessarily need to find, or train, a set of word-vectors with every word you need. Just elide, rather than worry-about, the rare missing words.

    Separate observations:

    • Stemming/lemmatization & stop-word removal aren't always worth the trouble, with all corpora/algorithms/goals. (And, stemming/lemmatization may wind up creating pseudowords that limit the model's interpretability & easy application to any texts that don't go through identical preprocessing.) So if those are required parts of laerning exercise, sure, get some experience using them. But don't assume they're necessarily helping, or worth the extra time/complexity, unless you verify that rigrously.
    • FastText models will also be able to supply synthetic vectors for words that aren't known to the model, based on substrings. These are often pretty weak, but may better than nothing - especially when they give vectors for typos, or rare infelcted forms, similar to morphologically-related known words. (Since this deduced similarity, from many similarly-written tokens, provides some of the same value as stemming/lemmatization via a different path that required the original variations to all be present during initial training, you'd especially want to pay attention to whether FastText & stemming/lemmatization mix well for your goals.) Beware, though: for very-short unknown words – for which the model learned no reusable substring vectors – FastText may still return an error or all-zeros vector.
    • FastText has a supervised classification mode, but it's not supported by Gensim. If you want to experiment with that, you'd need to use the Facebook FastText implementation. (You could still use a traditional, non-supervised FastText word vector model as a contributor of features for other possible representations.)