After I train a bigram model and a trigram model using Gensim, I can export the bigrams from the bigram model. Alternatively, I can export the bigrams from the trigram model. I find that the bigrams from the two models can be quite different. There is a large overlap. But there is a large number appearing in only one of the lists. What is the right way? Thanks!
bigram_model = gensim.models.Phrases(texts_unigram)
texts_bigram = [bigram_model[sent] for sent in texts]
trigram_model = gensim.models.Phrases(texts_bigram)
# Get from the bigram model
bigrams1 = list(bigram_model.export_phrases().keys())
# Get from the trigram model
ngrams = list(trigram_model.export_phrases().keys()) # This includes both bigrams and trigrams
bigrams2 = [g for g in ngrams if g.count("_")==1]
When you're applying the Phrases
-class statistical bigram-combinations multiple times, you're in experimental territory that's doesn't have well-established rules-of-thumb.
So you should be guided by your own project's evaluations of model effectiveness: for whatever your downstream purposes are, which set of n-grams works better?
Note also:
Phrases
class will often combine things that don't match human intuitions, & miss other things you might see as useful multiword n-grams, and tuning will often tend to improve some pairings only at the expense of others. Ultimately, the n-grams created this way may not be appropriate or attractive, for end-user display, but might still help as the input for classification/info-retrieval tasks.