Search code examples
pythonpandaspipelinepython-itertoolspython-class

Incremental ID in class returns incorrect ID when multiple instances of the class are used in a pipeline


Trying to add an ID attribute to a class that increments for each instance. Each instance is then passed into a pipeline, which is producing some unexpected results.

A reproducible example looks like the below.

Setting up the classes:

import itertools
import pandas as pd

class Parent:
    id_num = itertools.count()
    def __init__(self):
        ...

class Daughter(Parent):
    def __init__(self):
        self.ID  = next(Parent.id_num)
    
    def add_df(self, df):
        self.df = df
        self.df["ID_Num"] = self.ID


class DF_adder:
    def __init__(self, d1, d2, df):
        self.d1 = d1
        self.d2 = d2
        self.df = df
    
    def add_df(self):
        for daughter in [self.d1, self.d2]:
            print(f"Adding df to {daughter.ID}")
            daughter.add_df(self.df)

df = pd.DataFrame({"a":[1,2,3], "b":[4,5,6]})

Using the classes:

d1 = Daughter()
d2 = Daughter()

Examining the classes:

print(d1.ID)
$ 0
print(d2.ID)
$ 1

Using the Pipeline:

a = DF_adder(d1, d2, df)
a.add_df()

And here is the issue. When viewing the dataframes, the ID_Num column is 1 for all, despite the ID number attribute not changing.
d1.df gives

a b ID_Num
0 1 4 1
1 2 5 1
2 3 6 1

but if I check d1.ID again it still outputs 0. d2.df gives an identical table output and d2.ID gives 1, as expected.

What is causing this behaviour?

I can see that the ID variable is correct in the Daughter object, so I am not sure why it is using the highest ID in all cases in the pipeline. The value is also set correctly if set outside of the pipeline/DF_adder object, so I'm imagining the issue is the scope within the Pipeline class.


Solution

  • Your code is providing the same dataframe to both daughters, meaning that when you're doing self.df["ID_Num"] = self.ID in daughter.add_df(self.df) for d2, it's actually overriding the ID_Num of this dataframe towards which d1 is also pointing.

    A simple fix would be to copy the dataframe when providing it to the Daughter class, so that each daughter stores its own version and can modify it without side effects.
    Just do self.df = df.copy() in your Daughter.add_df method and you'll get the desired behavior.