Search code examples
pythonpandaswrappershallow-copy

Pandas dataframe shallow copy not reacting to data changes?


I have a wrapper class to work with a specific dataframe and some modifier functions/callables to operate with it.

class PhoneNumberCleaner:
    def __init__(self, data: pd.DataFrame, pattern: str):
        self.data = data  # shallow copy?
        self.pattern = pattern

    def __call__(self, *args, **kwargs) -> pd.DataFrame:
        drop_mask = self.data['phoneNumber'].apply(
            lambda pn: not re.fullmatch(self.pattern, pn)
        )
        drop_mask_index = drop_mask[drop_mask].index
        return self.data.drop(drop_mask_index)


class Wrapper:
    def __init__(self, data: pd.DataFrame):
        self.data = data

    def modify(self, modifier: Callable, *args, **kwargs):
        self.data = modifier(*args, **kwargs)

Now, let's say I have following data:

df_data = {
    'name': ['Mickey', 'Anna', 'Todd', 'Lee', 'Amanda', 'Jake'],
    'phoneNumber': [
        '0321111444---',
        '0335555666',
        '0330001234',
        '0330123456789',
        '0328888999',
        '0999999999999',
    ]
}
df = pd.DataFrame(df_data)

and I want to drop rows where person has incorrect phone number pattern:

wrapper = Wrapper(df)
number_cleaner = PhoneNumberCleaner(wrapper.data, r'\d{10}')
wrapper.modify(number_cleaner)

Printing wrapper data works fine:

print(wrapper.data)

     name phoneNumber
1    Anna  0335555666
2    Todd  0330001234
4  Amanda  0328888999

However, when I want to access same data through PhoneNumberCleaner object (that is supposed to refer to the same dataframe), I get the old data:

print(number_cleaner.data)

     name    phoneNumber
0  Mickey  0321111444---
1    Anna     0335555666
2    Todd     0330001234
3     Lee  0330123456789
4  Amanda     0328888999
5    Jake  0999999999999

I tried to add .copy(deep=False) when assigning data in Wrapper and PhoneNumberCleaner classes, but it doesn't help. What am I missing here?


Solution

  • This line:

    class PhoneNumberCleaner:
        def __call__(self, *args, **kwargs) -> pd.DataFrame:
            ...
            return self.data.drop(drop_mask_index)
    

    DataFrame.drop returns a new dataframe. The original dataframe (self.data) was not modified.

    Change it to:

    class PhoneNumberCleaner:
        def __call__(self, *args, **kwargs) -> pd.DataFrame:
            ...
            self.data.drop(drop_mask_index, inplace=True)
            return self.data