Search code examples
pythonlistcombinationscounterdata-analysis

Find frequency of relationships of tags in lists (pairwise correlation?)


I have some lists of tags for images. I want to find out which tags seem to be related:

l1 = ["cat", "toe", "man"]
l2 = ["cat", "toe", "ice"]
l3 = ["cat", "hat", "bed"]

In this (simple) example obviously, "cat" and "toe" seem related, because they appear two times (l1, l2).

How can this be computed? With a result like: cat & toe: 2. I have a clue that I am asking for "pairwise correlation" but resources to that kind of analysis are too complicated for me.


Solution

  • You can use collections.defaultdict with frozenset and itertools.combinations to form a dictionary of pairwise counts.

    Variations are possible. For example, you can use collections.Counter with sorted tuple instead, but fundamentally the same idea.

    from collections import defaultdict
    from itertools import combinations
    
    dd = defaultdict(int)
    
    L1 = ["cat", "toe", "man"]
    L2 = ["cat", "toe", "ice"]
    L3 = ["cat", "hat", "bed"]
    
    for L in [L1, L2, L3]:
        for pair in map(frozenset, (combinations(L, 2))):
            dd[pair] += 1
    

    Result:

    defaultdict(int,
                {frozenset({'cat', 'toe'}): 2,
                 frozenset({'cat', 'man'}): 1,
                 frozenset({'man', 'toe'}): 1,
                 frozenset({'cat', 'ice'}): 1,
                 frozenset({'ice', 'toe'}): 1,
                 frozenset({'cat', 'hat'}): 1,
                 frozenset({'bed', 'cat'}): 1,
                 frozenset({'bed', 'hat'}): 1})