Search code examples
pythonnlpgensimword2vectfidfvectorizer

Inconsistencies between bigrams found by TfidfVectorizer and Word2Vec model


I am building a topic model from scratch, one step of which uses the TfidfVectorizer method to get unigrams and bigrams from my corpus of texts:

    tfidf_vectorizer = TfidfVectorizer(min_df=0.1, max_df=0.9, ngram_range = (1,2))

After topics are created, I use the similarity scores provided by gensim's Word2Vec to determine coherence of topics. I do this by training on the same corpus:

    bigram_transformer = Phrases(corpus)
    model = Word2Vec(bigram_transformer[corpus], min_count=1)

For many of the bigrams in my topics however, I get a KeyError because that bigram was not picked up in the training of Word2Vec, despite them being trained on the same corpus. I think this is because Word2Vec decides on which bigrams to choose based on statistical analysis (Why aren't all bigrams created in gensim's `Phrases` tool?)

Is there a way to get the Word2Vec to include all those bigrams identified by TfidfVectorizer? I see trimming capabilities such as 'trim_rule' but not anything in the other direction.


Solution

  • The point of the Phrases model in Gensim is to pick some bigrams, which are calculated to be statistically-significant.

    If you then apply that model's determinations as a preprocessing step on your corpus, certain pairs of unigrams will be outright replaced in your text with the combined bigram. (As such, it's possible some unigrams that were there originally will no longer appear even once.)

    Thus the concepts of bigrams as used by Gensim's Phrases and the TfidfVectorizer's ngram_range facility are different. Phrases is meant for destructive replacements where specific bigrams are inferred to be more interesting than the unigrams. TfidfVectorizer will add extra bigrams as additional dimensional features.

    I suppose the right tuning of Phrases could cause it to consider every bigram as significant. Without checking, it looks like a super-tiny value, like 0.0000000001, might have essentially that effect. (The Phrases class will reject a value of 0 as nonsensical given its usual use.)

    But at that point, your later transformation (via bigram_transformer[corpus]) will combine every possible pair of words before Word2Vec training. For example, the sentence:

    ['the', 'skittish', 'cat', 'jumped', 'over', 'the', 'gap',]
    

    ...would indiscriminately become...

    ['the_skittish', 'cat_jumped', 'over_the', 'gap',]
    

    It seems unlikely that you want that, for a number of reasons:

    • There might then be no training texts with the 'cat' unigram alone, leaving you with no word-vector for that word at all.
    • Bigrams that are rare or of little grammatical value (like 'the_skittish') will receive trained word-vectors, & take up space in the model.
    • The kinds of text corpus that are large enough for good Word2Vec results might have far more bigrams than are manageable. (A corpus small enought that you can afford to track every bigram may be on the thin side for good Word2Vec results.)

    Further, to perform that greedy-combination of all bigrams, the Phrases frequency-survey & calculations aren't even necessary. (It can be done automatically with no preparation/analysis.)

    So, you shouldn't expect every bigram of TfidfVectorizer to be get a word-vector, unless you take some extra steps, outside the normal behavior of Phrases, to ensure every such bigram was in the training texts.

    To try to do so wouldn't necessarily need Phrases at all, and might be unmanageable, and involve other tradeoffs. (For example, I could imagine repeating the corpus many times, only combining a fraction of the bigrams each time – so that each is sometimes surrounded by other unigrams, and sometimes by other bigrams – to create a synthetic corpus with enough meaningful texts to create all your desired vectors. But the logic & storage space for that model would be larger & complicated, and without prominent precedent, so it'd be a novel experiment.)