I have some lists of tags for images. I want to find out which tags seem to be related:
l1 = ["cat", "toe", "man"]
l2 = ["cat", "toe", "ice"]
l3 = ["cat", "hat", "bed"]
In this (simple) example obviously, "cat" and "toe" seem related, because they appear two times (l1, l2).
How can this be computed? With a result like: cat & toe: 2. I have a clue that I am asking for "pairwise correlation" but resources to that kind of analysis are too complicated for me.
You can use collections.defaultdict
with frozenset
and itertools.combinations
to form a dictionary of pairwise counts.
Variations are possible. For example, you can use collections.Counter
with sorted tuple
instead, but fundamentally the same idea.
from collections import defaultdict
from itertools import combinations
dd = defaultdict(int)
L1 = ["cat", "toe", "man"]
L2 = ["cat", "toe", "ice"]
L3 = ["cat", "hat", "bed"]
for L in [L1, L2, L3]:
for pair in map(frozenset, (combinations(L, 2))):
dd[pair] += 1
Result:
defaultdict(int,
{frozenset({'cat', 'toe'}): 2,
frozenset({'cat', 'man'}): 1,
frozenset({'man', 'toe'}): 1,
frozenset({'cat', 'ice'}): 1,
frozenset({'ice', 'toe'}): 1,
frozenset({'cat', 'hat'}): 1,
frozenset({'bed', 'cat'}): 1,
frozenset({'bed', 'hat'}): 1})