Search code examples
pythonpython-dataclasses

How to compare large lists of dataclass objects fast in Python


I have two data sources which i must align with each other. One data source is the master and the other one shall mirror the data state of the master. I will call that client as of now.

I import the data from both master and client to my Python code and put them into dataclass objects. Then i have a list of dataclass objects representing my master data and a list of dataclass objects representing my client data.

Then i compare the data with each other. In the end i want to have a list of dataclassobjects containing the differences between those lists. Missing entries in the client list must be added to the client, suplus entries must be deleted.

My code looks like this:


def compare_data(csv_data_expected, csv_data_actual):

        csv_data_to_add = copy.deepcopy(csv_data_expected)
        csv_data_to_delete = copy.deepcopy(csv_data_actual)

        for entry in tqdm(csv_data_expected):
            if entry in csv_data_actual:
                csv_data_to_add.remove(entry)
                current_csv_data_to_delete.remove(entry)

        [setattr(x, "add", "ADD") for x in csv_data_to_add]
        [setattr(x, "delete", "DEL") for x in csv_data_to_delete]

return csv_data_to_delete + csv_data_to_add

The princible is as following: I make a copy of both the expected data (master) and the actual data (client). Then i loop over ONE of those arrays and throw every entry out which is in both lists. These are the entries which are present in both systems and not interesting for me. After the loop the to_add list contains only values not present in the expected list and the to_delete list only values not present in the actual list.

This was a good approach, when i did that with dictionaries. Now that i switched to dataclasses, this code suddendly takes unbearabely long to execute. The 40.000 entries from a dataclass with 3 attributes took half an our to compare on my office machine.

Is there any way to fasten things up and keep the dataclasses approach?


Solution

  • Make your dataclass hashable: Either by setting frozen=True and not using eq=False (see https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass) or implementing __hash__. Then you can use sets instead of lists, which makes membership checks, unions, diffs etc. much faster and eliminates your loops entirely:

    csv_data_to_add = csv_data_expected - csv_data_actual
    csv_data_to_delete = csv_data_actual - csv_data_expected
    

    Other remarks:

    • if entry in csv_data_actual and entry should be flipped as entry is much cheaper than entry in csv_data_actual. The right hand side of AND is not evaluated if the left hand side is already False.
    • if entry in csv_data_to_add: csv_data_to_add.remove(entry) first does a membership check and then has to search for the same entry again. Use try: csv_data_to_add.remove(entry), except ValueError: pass instead.