Search code examples
python-3.xlist-comprehensionbinary-search

Python: Faster way to filter a list using list comprehension


Consider the following problem: I want to keep elements of list1 that belongs to list2. So I can do something like this:

filtered_list = [w for w in list1 if w in list2]

I need to repeat this same procedure for different examples of list1 (about 20000 different examples) and a "constant" (frozen) list2.

How can I speed up the process?

I also know the following properties:

1) list1 has repeated elements and it is not sorted and it has about 10000 (ten thousand) items.

2) list2 is a giant sorted list (about 200000 - two hundred thousand) entries in Python) and each element is unique.

The first thing that comes to me is that maybe I can use a kind of binary search. However, is there a way to do this in Python?

Furthermore, I do not mind if filtered_list has the same order of items of list1. So, maybe I can check only a unrepeated version of list1 and after removing the elements in list1 that do not belong to list 2, I can return the repeated items.

Is there a fast way to do this in Python 3?


Solution

  • Convert list2 to a set:

    # do once
    set2 = set(list2)
    
    # then every time
    filtered_list = [w for w in list1 if w in set2]
    

    x in list2 is sequential; x in set2 uses the same mechanism as dictionaries, resulting in a very quick lookup.

    If list1 didn't have duplicates, converting both to sets and taking set intersection would be the way to go:

    filtered_set = set1 & set2
    

    but with duplicates you're stuck with iterating over list1 as above.

    (As you said, you could even see elements that you should delete, using set1 - set2, but then you'd still be stuck in a loop in order to delete - there shouldn't be any difference in performance between filtering keepers vs filtering trash, you still have to iterate over list1, so that's no win over the method above.)

    EDIT in response to comment: Converting list1 to a Counter would might (EDIT: or not; testing needed!) speed it up if you can use it normally like that (i.e. you never have a list, you always just deal with a Counter). But if you have to preprocess list1 into counter1 each time you do the above operation, again it's no win - creating a Counter will again involve a loop.