Search code examples
pythonnltk

How to find common phrases from a text document


I have a text file with a lot of comments/sentences, and I want to somehow find the most common phrases repeated in the document itself. I tried fiddling with it a bit with NLTK and I found this thread: How to extract common / significant phrases from a series of text entries

However, after trying it, I get odd results like these:

>>> finder.apply_freq_filter(3)
>>> finder.nbest(bigram_measures.pmi, 10)
[('m', 'e'), ('t', 's')]

And in another file where the phrase "this is funny" is very common, I get an empty list [].

How should I go about doing this?

Here's my full code:

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words('MkXVM6ad9nI.txt')

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)

Solution

  • I haven't used nltk, but I suspect the problem is that from_words accepts a string or tokens(?) object.

    Something akin to

    with open('MkXVM6ad9nI.txt') as wordfile:
        text = wordfile.read)
    
    tokens = nltk.wordpunct_tokenize(text)
    finder = BigramCollocationFinder.from_words(tokens)
    

    is likely to work, although there's probably a specialised API for files too.