Search code examples
pythonpandasfor-loopin-place

Changes to pandas dataframe in for loop is only partially saved


I have two dfs, and want to manipulate them in some way with a for loop.

I have found that creating a new column within the loop updates the df. But not with other commands like set_index, or dropping columns.

import pandas as pd
import numpy as np

gen1 = pd.DataFrame(np.random.rand(12,3))
gen2 = pd.DataFrame(np.random.rand(12,3))

df1 = pd.DataFrame(gen1)
df2 = pd.DataFrame(gen2)


all_df = [df1, df2]

for x in all_df:
    x['test'] = x[1]+1
    x = x.set_index(0).drop(2, axis=1)
    print(x)

Note that when each df is printed as per the loop, both dfs execute all the commands perfectly. But then when I call either df after, only the new column 'test' is there, and 'set_index' and 'drop' column is undone.

Am I missing something as to why only one of the commands have been made permanent? Thank you.


Solution

  • Here's what's going on:

    x is a variable that at the start of each iteration of your for loop initially refers to an element of the list all_df. When you assign to x['test'], you are using x to update that element, so it does what you want.

    However, when you assign something new to x, you are simply causing x to refer to that new thing without touching the contents of what x previously referred to (namely, the element of all_df that you are hoping to change).

    You could try something like this instead:

    for x in all_df:
        x['test'] = x[1]+1
        x.set_index(0, inplace=True)
        x.drop(2, axis=1, inplace=True)
    
    print(df1)
    print(df2)
    

    Please note that using inplace is often discouraged (see here for example), so you may want to consider whether there's a way to achieve your objective using new DataFrame objects created based on df1 and df2.