Search code examples
pythonperformancecountlookupappearance

Count co-appearance of 2 variables in list of lists


I have a list of lists containg string values each (~130.000 lists, each list has ~15 items). Lists may contain duplicates but that is by design, they need to stay = I can't use a set here.

I create tuple combinations of each list values(~5.600.000 tuples) and want to count how many times each tuple value alone and together appears in the lists.

So I need to lookup for each tuple values how many times they appear in the lists. So (5.600.000 * (130.000 * 15)) which is .... a lot.

Example:

tags: [['a', 'b', 'c', 'aa', 'bb', '2019'], ['a', 'd', '18', 'gb'], ['aa', 'a', 'dd', 'fb', 'la'], ['aa', 'd', 'ddaa', 'b', 'k', 'l']]
tagSet: {('a', 'aa'), ('a', 'b'), ('b', 'd'), ('aa', 'b'), ('aa', 'd'), ('a', 'd')}

for tagTuple in tagSet:
    tagA = tagTuple[0]
    tagB = tagTuple[1]
    sumA = sum(tagA in item for item in tags )
    sumB = sum(tagB in item for item in tags )
    sumAB = ??

For tuple (a, b) the result should be

a: 3, b:2, a+b: 1

But how do I count how may times a and b appear together in each list?

A performant way is needed because of the high number of lists and tuples I need to check.


Solution

  • Create a collections.Counter for how often each of the elements in the lists in tags appear, and a dict of Counter how often each element appears together with each other element.

    from collections import Counter, defaultdict
    from itertools import combinations
    counts = Counter()
    co_counts = defaultdict(Counter)
    for lst in tags:
        c = Counter(lst)
        counts.update(c)
        for a, b in combinations(set(lst), 2):
            co_counts[a][b] += min(c[a], c[b])
            co_counts[b][a] += min(c[a], c[b])
    

    Creating those is not entirely cheap, either, but much cheaper than what you are currently doing. If your tags and tagSet have N and M elements respectively, and the lists in tags have on average have K elements (K being much smaller than N or M), then this has "only" N * K² as opposed to N * M * K.

    Then, you can just get your values directly from those Counter dictionaries.

    for a, b in tagSet:
        print(a, b, counts[a], counts[b], co_counts[a][b])
    

    This gives the following output:

    a b 3 2 1
    a aa 3 3 2
    b d 2 2 1
    a d 3 2 1
    aa d 3 2 1
    aa b 3 2 2