Search code examples
python-3.xnltkcookbook

How do I use "BigramCollocationFinder" to find "Bigrams"?


I m studying compiler construction using python, I'm trying to create a list of all lowercased words in the text, and then produce BigramCollocationFinder, which we can use to find bigrams, which are pairs of words.

These bigrams are found using association measurement functions in the nltk.metrics package.

I'm practising from the "Python 3 Text Processing with NLTK 3 Cookbook" and I found this example code:

from nltk.corpus import webtext
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
words = [w.lower() for w in webtext.words('grail.txt')]
bcf = BigramCollocationFinder.from_words(words)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)

I'm stuck at:

bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
likelihood_ratio, 4

Here it mean similarity ratio or what does it means in this code.

Any guidance in this matter would be highly appreciated.


Solution

  • I believe NLTK collocations for specific words should answer your question. It calculates the PMI first and returns the top 4 words which occurs very frequently in your corpus.