I am currently working with a few DataFrames and want to make my code modular. That entails passing DataFrames to functions. I am aware of the mutable nature of DataFrames and some of the 'gotchas' when passing mutable instances to functions. Is there a best practice for DataFrames to the functions? Should I make a copy within the function and then pass it back? Or should I just make changes to df within the function and return None?
Is option 1 or 2 better? Below is basic code to convey the idea:
Option 1:
def test(df):
df['col1'] = df['col1']+1
return None
test(df)
Option 2:
def test(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
main_df = test(main_df)
I use a lot of DataFrame.pipe
to organize my code so, I'm going to say option 2. pipe
takes and returns a DataFrame and you can chain multiple steps together.
def step1(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
def step2(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
def setp3(main_df):
df = main_df.copy()
df['col1'] = df['col1']+1
return df
main_df = (main_df.pipe(step1)
.pipe(step2)
.pipe(step3)
)