I noticed the number of bigrams is higher than the number of unigrams and that there are more trigrams than there are bigrams. So basically, the number of ngrams are more than the number of unigrams. I don't understand how this is possible.
New Delhi is the capital of India.
No of unigrams - 7
No of bigrams - 6
No of trigrams - 5
Here I am clearly seeing that the number of unigrams will always be greater than ngrams.
Number of Words in a bigram is greater than the number of words in a unigram. Similarly, number of words in a trigram is greater than the number of words in a bigram.
This is true if you would perform it on an actual dataset that contains many strings. If you would perform it on a single string then the result would be the opposite.
Let us understand it with the help of an example. Let us say string 1 contains: w1,w2,w3,w4,w5,w6
and string 2 contains: w1,w7,w3,w2,w5,w4,w6
.
so total no of unigrams over here is {w1,w2,w3,w4,w5,w6,w7}. So total number of words in unigram is 7.
Now let us see in the case of a bigram. Total number of words in bigrams are:
{(w1,w2),(w2,w3),(w3,w4),(w4,w5),(w5,w6),(w1,w7),(w7,w3),(w3,w2),(w2,w5),(w5,w4),(w4,w6)}
So total no of words in bigram over here is 11.
This happens because there are many words that repeat in case of unigram but in case of bigram fewer words repeat and in case of trigrams even lesser number of words would repeat. As you would increase the value of n in n-grams fewer number of words would start to repeat and hence the number of words increases as you increase in the value of n in n-grams.