Search code examples

NLTK collocations for specific words

I know how to get bigram and trigram collocations using NLTK and I apply them to my own corpora. The code is below.

I'm not sure however about (1) how to get the collocations for a particular word? (2) does NLTK have a collocation metric based on Log-Likelihood Ratio?

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black  sheep shep bar bar black sentence"

trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(text))

for i in finder.score_ngrams(trigram_measures.pmi):
    print i


  • Try this code:

    import nltk
    from nltk.collocations import *
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    trigram_measures = nltk.collocations.TrigramAssocMeasures()
    # Ngrams with 'creature' as a member
    creature_filter = lambda *w: 'creature' not in w
    ## Bigrams
    finder = BigramCollocationFinder.from_words(
    # only bigrams that appear 3+ times
    # only bigrams that contain 'creature'
    # return the 10 n-grams with the highest PMI
    print finder.nbest(bigram_measures.likelihood_ratio, 10)
    ## Trigrams
    finder = TrigramCollocationFinder.from_words(
    # only trigrams that appear 3+ times
    # only trigrams that contain 'creature'
    # return the 10 n-grams with the highest PMI
    print finder.nbest(trigram_measures.likelihood_ratio, 10)

    It uses the likelihood measure and also filters out Ngrams that don't contain the word 'creature'