Search code examples
pythondictionarycombinationscounterpython-itertools

Finding the proportion of occurrences of pairs in a list of lists


I have a dataset that is a list of lists of IDs. Each list can have one or more unique IDs.

a = [
['123','456','789'],
['123'],
['123','456'],
['456','789']
]

I want to calculate the following value for each pair of IDs: Value(id1,id2) = # of lists with both id1,id2 / # of lists with either id1 or id2 The expected result for the example list a above is as follows:

Value('123','456') = 0.5
Value('123','789') = 0.25
Value('456','789') = 0.66

Edit: I should have been more specific from the get-go, I would actually like to get this result in matrix form based on a pre-specified ordering of IDs (for instance '123','456','789'):

Value = array([[1, 0.5, 0.25],
               [0.5, 1, 0.66],
               [0.25, 0.66, 1]])

The matrix will be symmetric.

The approach I've tried is making two dictionaries of

  1. counts for each pair of IDs as described by the top answer in this post: Finding the most frequent occurrences of pairs in a list of lists.
  2. counts for each individual ID
from collections import Counter
from itertools import combinations
d = Counter()
s = Counter()
for lst in a:
    for id in lst:
        s[id] += 1
    if len(lst) < 2:
        continue
    lst.sort()
    for comb in combinations(lst,2):
        d[comb] += 1

I then created a final dictionary that applies the Value function above to each pair of IDs using the two dictionaries s and d in a for loop.

finaldict={k: d[k]/(s[k[0]]+s[k[1]]-d[k]) for k in d.keys()}

This part is taking a very long time as the number of unique IDs is very large (approx 100,000) and I wonder if there is a quicker way to do this.

Edit: Assuming I get this finaldict created, I would then have to find a way to change it into the desired matrix form.


Solution

  • You can use set operations for the task:

    from itertools import combinations
    
    a = [["123", "456", "789"], ["123"], ["123", "456"], ["456", "789"]]
    
    d = {}
    for i, l in enumerate(a):
        for v in l:
            d.setdefault(v, set()).add(i)
    
    for n1, n2 in combinations(d, 2):
        i = d[n1].intersection(d[n2])
        u = d[n1].union(d[n2])
    
        print(n1, n2, len(i) / len(u))
    

    Prints:

    123 456 0.5
    123 789 0.25
    456 789 0.6666666666666666
    

    EDIT: To create a symmetrical DataFrame:

    from itertools import combinations
    
    from pandas import DataFrame
    
    a = [["123", "456", "789"], ["123"], ["123", "456"], ["456", "789"]]
    
    d = {}
    for i, l in enumerate(a):
        for v in l:
            d.setdefault(v, set()).add(i)
    
    out = []
    for n1, n2 in combinations(d, 2):
        i = d[n1].intersection(d[n2])
        u = d[n1].union(d[n2])
        out.append((n1, n2, len(i) / len(u)))
    
    df = pd.DataFrame()
    for k1, k2, v in out:
        df.loc[k1, k2] = v
        df.loc[k2, k1] = v
    
    df = df.fillna(1).sort_index(axis=0).sort_index(axis=1)
    print(df)
    

    Prints:

          123       456       789
    123  1.00  0.500000  0.250000
    456  0.50  1.000000  0.666667
    789  0.25  0.666667  1.000000