Here's the problem...
I have a list of strings:
strings = ['one two three four', 'one two four five', 'four one two', 'three four']
I'm trying to find combinations of words that co-occur in two or more strings.
And here's the output I'm trying to get...
The combinations could be any length of two or more words.
Here's what I've already looked at - though I'm not having much luck finding anything I can bootstrap for my needs : (
You can compute the powersets with minimum 2 combinations and count the combinations:
from itertools import chain, combinations
from collections import Counter
# https://docs.python.org/3/library/itertools.html
def powerset(iterable, MIN=2):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(MIN, len(s)+1))
c = Counter(chain.from_iterable(set(powerset(s.split()))
for s in strings))
# keep counts of 2 or more
out = {k: v for k, v in c.items() if v >= 2}
Output:
{('three', 'four'): 2,
('two', 'four'): 2,
('one', 'two', 'four'): 2,
('one', 'four'): 2,
('one', 'two'): 3}
Use:
c = Counter(chain.from_iterable(tuple(powerset(s.split()))
for s in strings))