I have a list of lists containg string values each (~130.000 lists, each list has ~15 items). Lists may contain duplicates but that is by design, they need to stay = I can't use a set here.
I create tuple combinations of each list values(~5.600.000 tuples) and want to count how many times each tuple value alone and together appears in the lists.
So I need to lookup for each tuple values how many times they appear in the lists. So (5.600.000 * (130.000 * 15)) which is .... a lot.
Example:
tags: [['a', 'b', 'c', 'aa', 'bb', '2019'], ['a', 'd', '18', 'gb'], ['aa', 'a', 'dd', 'fb', 'la'], ['aa', 'd', 'ddaa', 'b', 'k', 'l']]
tagSet: {('a', 'aa'), ('a', 'b'), ('b', 'd'), ('aa', 'b'), ('aa', 'd'), ('a', 'd')}
for tagTuple in tagSet:
tagA = tagTuple[0]
tagB = tagTuple[1]
sumA = sum(tagA in item for item in tags )
sumB = sum(tagB in item for item in tags )
sumAB = ??
For tuple (a, b) the result should be
a: 3, b:2, a+b: 1
But how do I count how may times a and b appear together in each list?
A performant way is needed because of the high number of lists and tuples I need to check.
Create a collections.Counter
for how often each of the elements in the lists in tags
appear, and a dict
of Counter
how often each element appears together with each other element.
from collections import Counter, defaultdict
from itertools import combinations
counts = Counter()
co_counts = defaultdict(Counter)
for lst in tags:
c = Counter(lst)
counts.update(c)
for a, b in combinations(set(lst), 2):
co_counts[a][b] += min(c[a], c[b])
co_counts[b][a] += min(c[a], c[b])
Creating those is not entirely cheap, either, but much cheaper than what you are currently doing. If your tags
and tagSet
have N
and M
elements respectively, and the lists in tags
have on average have K
elements (K
being much smaller than N
or M
), then this has "only" N * K²
as opposed to N * M * K
.
Then, you can just get your values directly from those Counter
dictionaries.
for a, b in tagSet:
print(a, b, counts[a], counts[b], co_counts[a][b])
This gives the following output:
a b 3 2 1
a aa 3 3 2
b d 2 2 1
a d 3 2 1
aa d 3 2 1
aa b 3 2 2