Search code examples
pythonnltkcollocation

Count ngram word frequency using text collocations


I would like to count the frequency of three words preceding and following a specific word from a text file which has been converted into tokens.

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
    text_data = textfile.read().replace('\n', ' ').lower()
tokens = nltk.word_tokenize(text_data)
text = nltk.Text(tokens)
grams = nltk.ngrams(tokens, 4)
freq = Counter(grams)
freq.most_common(20)

I don't know how to search for the string 'dracula' as a filter word. I also tried:

text.collocations(num=100)
text.concordance('dracula')

The desired output would look something like this with counts: Three words preceding 'dracula', sorted count

(('and', 'he', 'saw', 'dracula'), 4),
(('one', 'cannot', 'see', 'dracula'), 2)

Three words following 'dracula', sorted count

(('dracula', 'and', 'he', 'saw'), 4),
(('dracula', 'one', 'cannot', 'see'), 2)

The trigram containing 'dracula' in the middle, sorted count

(('count', 'dracula', 'saw'), 4),
(('count', 'dracula', 'cannot'), 2)

Thank you in advance for any help.


Solution

  • Once you get the frequency information in tuple format, as you've done, you can simply filter out the word you're looking for with if statements. This is using Python's list comprehension syntax:

    from nltk.tokenize import sent_tokenize
    from nltk.tokenize import word_tokenize
    from nltk.util import ngrams
    
    with open('dracula.txt', 'r', encoding="ISO-8859-1") as textfile:
        text_data = textfile.read().replace('\n', ' ').lower()
        # pulled text from here: https://archive.org/details/draculabr00stokuoft/page/n6
    
    tokens = nltk.word_tokenize(text_data)
    text = nltk.Text(tokens)
    grams = nltk.ngrams(tokens, 4)
    freq = nltk.Counter(grams)
    
    dracula_last = [item for item in freq.most_common() if item[0][3] == 'dracula']
    dracula_first = [item for item in freq.most_common() if item[0][0] == 'dracula']
    dracula_second = [item for item in freq.most_common() if item[0][1] == 'dracula']
    # etc.
    

    This produces lists with "dracula" in different positions. Here is what dracula_last looks like:

    [(('the', 'castle', 'of', 'dracula'), 3),
     (("'s", 'journal', '243', 'dracula'), 1),
     (('carpathian', 'moun-', '2', 'dracula'), 1),
     (('of', 'the', 'castle', 'dracula'), 1),
     (('named', 'by', 'count', 'dracula'), 1),
     (('disease', '.', 'count', 'dracula'), 1),
     ...]