Search code examples
pandasdataframedeep-copy

How to make sure pandas dataframe is not changed after passed into a function


I am working with not-so-small df (1.7GB+ and contains python objects) which I have to do number of calculation and return a list of strings.

However, as mentioned in the documentation of pd.copy, the deep copy is not recursive which means the python objects in my df can potentially be changed in the function.

The problem is, because I have to call the function a lot and due to the size of the df, deep copying each column every time the function is called is not an option.

Is there any tips, tricks, testing methods, or anything that can help?

EDIT: Also after some testing by just reassigning (df_copy=df rather than df=df.copy()), it was found out that applying functions such as groupby or explode to a df does not make changes to the original df while others such as iloc, loc, or sort_values do cause changes.

What causes that?


Solution

  • I had the same question when I just started learning Python too.

    There are some suggestions I can give:

    1. Keep the original data untouched, and if you need to use everything in the DataFrame, try to make only one copy of the original data;

    2. Use df.loc[] function to get the data you need, rather than copying the whole DataFrame;

    3. Try to limit the use of df['column_name'] if you need to modify the data later on, as the modification will change the original DataFrame.

    I am still learning myself, so these are definitely not complete answers. I will write some examples for you to play around within Jupyter Notebook if I got some more time in the future.

    I have created a GitHub repository to store sample code for solutions, feel free to browse and play around.

    My StackOverflow repository on GitHub

    My Blog