Search code examples
pythonpandasnlp

Combinations of words that co-occur most often across strings


Here's the problem...

I have a list of strings:

strings = ['one two three four', 'one two four five', 'four one two', 'three four']

I'm trying to find combinations of words that co-occur in two or more strings.

And here's the output I'm trying to get...

  • [one, two, four] - 3 times
  • [three, four] - 2 times
  • [one, two] - 3 times
  • [two, four] - 3 times

The combinations could be any length of two or more words.

Here's what I've already looked at - though I'm not having much luck finding anything I can bootstrap for my needs : (


Solution

  • You can compute the powersets with minimum 2 combinations and count the combinations:

    from itertools import chain, combinations
    from collections import Counter
    
    # https://docs.python.org/3/library/itertools.html
    def powerset(iterable, MIN=2):
        "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
        s = list(iterable)
        return chain.from_iterable(combinations(s, r) for r in range(MIN, len(s)+1))
    
    c = Counter(chain.from_iterable(set(powerset(s.split()))
                for s in strings))
    
    # keep counts of 2 or more
    out = {k: v for k, v in c.items() if v >= 2}
    

    Output:

    {('three', 'four'): 2, 
     ('two', 'four'): 2, 
     ('one', 'two', 'four'): 2, 
     ('one', 'four'): 2, 
     ('one', 'two'): 3}
    

    keep order

    Use:

    c = Counter(chain.from_iterable(tuple(powerset(s.split()))
                for s in strings))