Search code examples
pythonintegercomparepython-itertools

Python 3: Efficient way to loop through and compare integer lists?


I'm trying to compare two huge lists which contain 10,000+ lists integers. Each sub-list contains 20 integers, which are random between 1 and 99. Within the sub-lists all integers are unique.

list1 = [[1, 25, 23, 44, ...], [3, 85, 9, 24, 34, ...], ...]
list2 = [[3, 83, 45, 24, ...], [9, 82, 3, 47, 36, ...], ...]
result = compare_lists(list1, list2)

The compare_lists() function would compare integer from two lists that are in the same position, and return the two lists if the integers are different.

It is obviously very inefficient to loop through each sub-list as there are 100 Million+ possible combinations. (each of the 10,000+ sub-lists in list1 gets compared to 10,000+ in list2)

import itertools
def compare_lists(list1, list2):
    for (a, b) in itertools.product(list1, list2):
        count = 0
        for z in range(20):
            if a[z] != b[z]:
                count += 1
        if count == 20:
            yield [a, b]

For example (i'll use 4 integers per list):

a = [1, 2, 3, 4] # True
b = [5, 6, 7, 8] # (integers are different)

a = [1, 2, 3, 4] # True
b = [2, 3, 4, 1] # (same integers but not in same position, still true)

a = [1, 2, 3, 4] # False
b = [1, 6, 7, 8] # (position [0] is identical)

itertools.product appears to be very inefficient in situations like this. Is there a faster or more efficient way to do this?

Sorry if this is unclear, I've only recently started using Python.


Solution

  • I don't know how to reduce the number of list-list comparisons based on some precomputed data in general.

    Maybe you can get some advantage if the dataset has some property. For example, if you know that the vast majority of the possible 100M+ pairs will be in your output, I would focus on finding the small minority of rejected pairs. If value V appears on position P in a sublist, you can categorize the data in such way that every sublist belongs to 20 categories (P,V) from roughly 2K possibilities (20 positions * 99 values). Two sublist compare False it they share a category. This way you could build in few steps a set of (i,j) pairs such that list1[i] compares False with list2[j]. The output is than everything else from the carthesian product of possible indices i,j.


    BTW, you can make the comparison a little bit more efficient than it currently is.

    One matching pair a[z] == b[z] is enough to know the result is False.

        for z in range(20):
            if a[z] == b[z]:
                break
        else:
            yield [a, b]
    

    or equivalent:

        if all(i != j for i,j in zip(a,b)):
            yield [a, b]
    

    I did not run timing test which one is faster. Anyway the speedup is probably marginal.