Are nested for loops always slow?

It seems there are quite a few questions and answers related to the speed of nested for loops - I think I looked at every single one of them! But unfortunately I am still not exactly sure why my code is slow. I'm hoping I can get some guidance from you fine people.

I download a csv file daily that has ~116,000 entries. Items are added and taken away from it at inconsistent points in the file, so every day I want to see what was added and what was taken away.

Getting the entries from csv to a list takes no time at all, for both the old and new list, but I encounter a big speed decrease in the next part of the code, although at the end, it does what I want and spits out the difference - items added and items removed.

Each of the 116,000 items in the list is a dictionary like so:

old or new = [{'Date Stamped': '', 'Name': '', 'Registration Number': '', 'Type': '', "Form Name':  '', 'URL': "}]

when I get to this point:

added = [i for i in new if not i in old]
removed = [i for i in old if not i in new]

It takes 25 minutes to finish! I feel like this is a long time, but also I may not be understanding exactly what I'm doing.

Each list (old and new) has ~116000 items in it. Is that because i has to iterate through ~116,000 items 4 times?

It does what I want, in the end, but it seems awfully slow for what it's doing; that said, this is really the first time I've worked with a data set with this many items, so maybe it's par for course.

Is this slow because it is a nested for loop? Is it slow because of the size? I am definitely an amateur and really appreciate everyone's help. Thanks so much.

Solution

Effectively, yes, it's slow because it's a nested for loop, because of the size.

Python's element in list operation works by just searching the entire list, element by element, for the one it wants. If you have to do that for every single element in new, that means you're possibly searching the entire old for each element in new.

Lists are not a good datastructure for searching through. What you should be doing instead, if you have a use case like this, is to transform them into a set first - an unordered collection (but order probably doesn't matter) which uses a hashtable to determine whether elements are present in it. Now, instead of searching the entire datastructure element-by-element, it can just hash the element being searched for, check if there's an element there, and say so if so.

In other words, element in set is an order of magnitude more efficient than element in list. For a relatively small overhead cost (in creating the sets in the first place), this shaves a huge amount of time off of the for loops that will follow:

old_set = set(old)
new_set = set(new)
added = [i for i in new if not i in old_set]
removed = [i for i in old if not i in new]

Furthermore, you can even dispense with the list comprehension, because set supports operations from set theory - taking the difference between two sets (elemenents in one set that are not in the other) is as easy as subtracting them:

added = list(new_set - old_set)  # (new_set - old_set) is identical to new_set.difference(old_set)
removed = list(old_set - new_set)

which is probably even more efficient than a list comprehension, because it's optimized for exactly this use case.