Search code examples
pythonprocessing-efficiencycomputation

Find intersection of large number of lists in python


I have a file containing space separated numbers in each line. Each line corresponds to a list of numbers.
Now there are around 300,000 such lines (each line containing around 100 numbers on an average).
I want to find mutual intersection of all such lists i.e. first list intersected with all other lists, then second list intersected with all other lists and so on.
I am using

set(a) & set(b)  

where a and b are lists I get iterating in a double loop.
But this is taking too much time. For ex : for first list intersected with all other lists, it took about 3 minutes.
How can I do this efficiently? (may be with some other language/tool)


Solution

  • You should use generator expressions here, they do lazy evaluation and save a lot of memory:

    In [46]: from itertools import imap
    
    In [47]: a = [[1,2,3], [2,3,4], [3,4,5]]
    
    In [48]: reduce(set.intersection,imap(set,a))
    Out[48]: set([3])
    

    considering your file looks like this:

    1 2 3
    2 3 4
    3 4 5
    

    code: Using itertools.combinations():

    with open("abc.txt") as f:
        lines=(map(int,x.split()) for x in f)
        for x in combinations(lines,2):
            print x,'-->',reduce(set.intersection,imap(set,x))
       ....:         
    ([1, 2, 3], [2, 3, 4]) --> set([2, 3])
    ([1, 2, 3], [3, 4, 5]) --> set([3])
    ([2, 3, 4], [3, 4, 5]) --> set([3, 4])