Search code examples
pythonnlpnltkn-gram

Computing N Grams using Python


I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:

"Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."

I started in Python and used the following code:

#!/usr/bin/env python
# File: n-gram.py
def N_Gram(N,text):
NList = []                      # start with an empty list
if N> 1:
    space = " " * (N-1)         # add N - 1 spaces
    text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
    NList.append(text[i:i+N])
return NList                    # return the list
# test code
for i in range(5):
print N_Gram(i+1,"text")
# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
print '"' + ngram + '"'

http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word

But it works for all the n-grams within a word, when I want it from between words as in CYSTIC and FIBROSIS or CYSTIC FIBROSIS. Can someone help me out as to how I can get this done?


Solution

  • Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

    def ngrams(input, n):
        input = input.split(' ')
        output = []
        for i in range(len(input)-n+1):
            output.append(input[i:i+n])
        return output
    
    ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]
    

    If you want those joined back into strings, you might call something like:

    [' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']
    

    Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

    for g in (' '.join(x) for x in ngrams(input, 2)):
        grams.setdefault(g, 0)
        grams[g] += 1
    

    Putting that all together into one final function gives:

    def ngrams(input, n):
       input = input.split(' ')
       output = {}
       for i in range(len(input)-n+1):
           g = ' '.join(input[i:i+n])
           output.setdefault(g, 0)
           output[g] += 1
       return output
    
    ngrams('a a a a', 2) # {'a a': 3}