Consider the following problem: I want to keep elements of list1 that belongs to list2. So I can do something like this:
filtered_list = [w for w in list1 if w in list2]
I need to repeat this same procedure for different examples of list1 (about 20000 different examples) and a "constant" (frozen) list2.
How can I speed up the process?
I also know the following properties:
1) list1 has repeated elements and it is not sorted and it has about 10000 (ten thousand) items.
2) list2 is a giant sorted list (about 200000 - two hundred thousand) entries in Python) and each element is unique.
The first thing that comes to me is that maybe I can use a kind of binary search. However, is there a way to do this in Python?
Furthermore, I do not mind if filtered_list has the same order of items of list1. So, maybe I can check only a unrepeated version of list1 and after removing the elements in list1 that do not belong to list 2, I can return the repeated items.
Is there a fast way to do this in Python 3?
Convert list2
to a set
:
# do once
set2 = set(list2)
# then every time
filtered_list = [w for w in list1 if w in set2]
x in list2
is sequential; x in set2
uses the same mechanism as dictionaries, resulting in a very quick lookup.
If list1
didn't have duplicates, converting both to sets and taking set intersection would be the way to go:
filtered_set = set1 & set2
but with duplicates you're stuck with iterating over list1
as above.
(As you said, you could even see elements that you should delete, using set1 - set2
, but then you'd still be stuck in a loop in order to delete - there shouldn't be any difference in performance between filtering keepers vs filtering trash, you still have to iterate over list1
, so that's no win over the method above.)
EDIT in response to comment: Converting list1
to a Counter
would might (EDIT: or not; testing needed!) speed it up if you can use it normally like that (i.e. you never have a list, you always just deal with a Counter
). But if you have to preprocess list1
into counter1
each time you do the above operation, again it's no win - creating a Counter
will again involve a loop.