I have a dataset that is a list of lists of IDs. Each list can have one or more unique IDs.
a = [
['123','456','789'],
['123'],
['123','456'],
['456','789']
]
I want to calculate the following value for each pair of IDs: Value(id1,id2) = # of lists with both id1,id2 / # of lists with either id1 or id2 The expected result for the example list a above is as follows:
Value('123','456') = 0.5
Value('123','789') = 0.25
Value('456','789') = 0.66
Edit: I should have been more specific from the get-go, I would actually like to get this result in matrix form based on a pre-specified ordering of IDs (for instance '123','456','789'):
Value = array([[1, 0.5, 0.25],
[0.5, 1, 0.66],
[0.25, 0.66, 1]])
The matrix will be symmetric.
The approach I've tried is making two dictionaries of
from collections import Counter
from itertools import combinations
d = Counter()
s = Counter()
for lst in a:
for id in lst:
s[id] += 1
if len(lst) < 2:
continue
lst.sort()
for comb in combinations(lst,2):
d[comb] += 1
I then created a final dictionary that applies the Value function above to each pair of IDs using the two dictionaries s and d in a for loop.
finaldict={k: d[k]/(s[k[0]]+s[k[1]]-d[k]) for k in d.keys()}
This part is taking a very long time as the number of unique IDs is very large (approx 100,000) and I wonder if there is a quicker way to do this.
Edit: Assuming I get this finaldict created, I would then have to find a way to change it into the desired matrix form.
You can use set operations for the task:
from itertools import combinations
a = [["123", "456", "789"], ["123"], ["123", "456"], ["456", "789"]]
d = {}
for i, l in enumerate(a):
for v in l:
d.setdefault(v, set()).add(i)
for n1, n2 in combinations(d, 2):
i = d[n1].intersection(d[n2])
u = d[n1].union(d[n2])
print(n1, n2, len(i) / len(u))
Prints:
123 456 0.5
123 789 0.25
456 789 0.6666666666666666
EDIT: To create a symmetrical DataFrame:
from itertools import combinations
from pandas import DataFrame
a = [["123", "456", "789"], ["123"], ["123", "456"], ["456", "789"]]
d = {}
for i, l in enumerate(a):
for v in l:
d.setdefault(v, set()).add(i)
out = []
for n1, n2 in combinations(d, 2):
i = d[n1].intersection(d[n2])
u = d[n1].union(d[n2])
out.append((n1, n2, len(i) / len(u)))
df = pd.DataFrame()
for k1, k2, v in out:
df.loc[k1, k2] = v
df.loc[k2, k1] = v
df = df.fillna(1).sort_index(axis=0).sort_index(axis=1)
print(df)
Prints:
123 456 789
123 1.00 0.500000 0.250000
456 0.50 1.000000 0.666667
789 0.25 0.666667 1.000000