Search code examples
pythonpython-2.7pysparkn-grampython-collections

most common 2-grams using python


Given a string:

this is a test this is

How can I find the top-n most common 2-grams? In the string above, all 2-grams are:

{this is, is a, test this, this is}

As you can notice, the 2-gram this is appears 2 times. Hence the result should be:

{this is: 2}

I know I can use Counter.most_common() method to find the most common elements, but how can I create a list of 2-grams from the string to begin with?


Solution

  • You can use the method provided in this blog post to conveniently create n-grams in Python.

    from collections import Counter
    
    bigrams = zip(words, words[1:])
    counts = Counter(bigrams)
    print(counts.most_common())
    

    That assumes that the input is a list of words, of course. If your input is a string like the one you provided (which does not have any punctuation), then you can do just words = text.split(' ') to get a list of words. In general, though, you would have to take punctuation, whitespace and other non-alphabetic characters into account. In that case you might do something like

    import re
    
    words = re.findall(r'[A-Za-z]+', text)
    

    or you could use an external library such as nltk.tokenize.

    Edit. If you need tri-grams or any other any other n-grams in general then you can use the function provided in the blog post I linked to:

    def find_ngrams(input_list, n):
      return zip(*(input_list[i:] for i in range(n)))
    
    trigrams = find_ngrams(words, 3)