I have created a bigram model using gensim and the try to get the bigram sentences but it's not picking all bigram sentences why?
from gensim.models.phrases import Phrases, Phraser
phrases = Phrases(sentences, min_count=1, threshold=1)
bigram_model = Phraser(phrases)
sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
print(bigram_model[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']
Can anyone explain how to get all bigrams.
The Phrases
algorithm decides which word-pairs to promote to bigrams by a statistical analysis, which compares the base frequencies of each word individually with their frequency together.
So, some word-pairs will pass this test, and be combined, and others won't. If you're not getting the pairings you expect, then you can tune the algorithm somewhat using the Phrases
class options, including threshold
, min_count
, and at least one alternate scoring-mechanism.
But, even maximally tuned, it won't typically create all the phrases that we, as natural-language speakers, would perceive – as it knows nothing of grammar, or the actually logically-related entities of the world. It only knows frequency statistics in the training text.
So there will be pairings it misses we'd see as natural & desirable, and pairings it creates we'd see as illogical. Still, even with these unaesthetic pairings – creating text that doesn't look right to people – the transformed text can often work better in certain downstream classification or information-retrieval tasks.
If you really just wanted all possible bigrams, that'd be a much more simple text transformation, not requiring the multiple-passes & internal statistics-collection of gensim's Phrases
.
But also, if you do want to use gensim's Phrases
technique, it will only perform well when it has a lot of training data. Toy-sized texts of just a few dozen words –or even many tens-of-thousands of words – won't give good results. You'd want millions to tens-of-millions of training words to have some chance of it really detecting statistically-valid word-pairings.