Search code examples
pythonregexn-gram

Counting bigrams (pair of two words) in a file using Python


I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex "\w+\s\w+" on file contents, but it did not prove to be efficient.

e.g. Let's say I want to count the number of bigrams from a file a.txt, which has following content:

"the quick person did not realize his speed and the quick person bumped "

For above file, the bigram set and their count will be :

(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1

I have come across an example of Counter objects in Python, which is used to count unigrams (single words). It also uses regex approach.

The example goes like this:

>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('\w+', open('a.txt').read())
>>> print Counter(words)

The output of above code is :

[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
 ('realize', 1),  ('his', 1), ('speed', 1), ('bumped', 1)]

I was wondering if it is possible to use the Counter object to get count of bigrams. Any approach other than Counter object or regex will also be appreciated.


Solution

  • Some itertools magic:

    >>> import re
    >>> from itertools import islice, izip
    >>> words = re.findall("\w+", 
       "the quick person did not realize his speed and the quick person bumped")
    >>> print Counter(izip(words, islice(words, 1, None)))
    

    Output:

    Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, 
      ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1, 
      ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1, 
      ('realize', 'his'): 1})
    

    Bonus

    Get the frequency of any n-gram:

    from itertools import tee, islice
    
    def ngrams(lst, n):
      tlst = lst
      while True:
        a, b = tee(tlst)
        l = tuple(islice(a, n))
        if len(l) == n:
          yield l
          next(b)
          tlst = b
        else:
          break
    
    >>> Counter(ngrams(words, 3))
    

    Output:

    Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1, 
      ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1, 
      ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1, 
      ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1, 
      ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})
    

    This works with lazy iterables and generators too. So you can write a generator which reads a file line by line, generating words, and pass it to ngarms to consume lazily without reading the whole file in memory.