Search code examples
pythonpandasdataframepandas-apply

pandas df.apply unexpectedly changes dataframe inplace


From my understanding, pandas.DataFrame.apply does not apply changes inplace and we should use its return object to persist any changes. However, I've found the following inconsistent behavior:

Let's apply a dummy function for the sake of ensuring that the original df remains untouched:

>>> def foo(row: pd.Series):
...     row['b'] = '42'

>>> df = pd.DataFrame([('a0','b0'),('a1','b1')], columns=['a', 'b'])
>>> df.apply(foo, axis=1)
>>> df
    a   b
0   a0  b0
1   a1  b1

This behaves as expected. However, foo will apply the changes inplace if we modify the way we initialize this df:

>>> df2 = pd.DataFrame(columns=['a', 'b'])
>>> df2['a'] = ['a0','a1']
>>> df2['b'] = ['b0','b1']
>>> df2.apply(foo, axis=1)
>>> df2
    a   b
0   a0  42
1   a1  42

I've also noticed that the above is not true if the columns dtypes are not of type 'object'. Why does apply() behave differently in these two contexts?

Python: 3.6.5

Pandas: 0.23.1


Solution

  • Interesting question! I believe the behavior you're seeing is an artifact of the way you use apply.

    As you correctly indicate, apply is not intended to be used to modify a dataframe. However, since apply takes an arbitrary function, it doesn't guarantee that applying the function will be idempotent and will not change the dataframe. Here, you've found a great example of that behavior, because your function foo attempts to modify the row that it is passed by apply.

    Using apply to modify a row could lead to these side effects. This isn't the best practice.

    Instead, consider this idiomatic approach for apply. The function apply is often used to create a new column. Here's an example of how apply is typically used, which I believe would steer you away from this potentially troublesome area:

    import pandas as pd
    # construct df2 just like you did
    df2 = pd.DataFrame(columns=['a', 'b'])
    df2['a'] = ['a0','b0']
    df2['b'] = ['a1','b1']
    
    df2['b_copy'] = df2.apply(lambda row: row['b'], axis=1) # apply to each row
    df2['b_replace'] = df2.apply(lambda row: '42', axis=1) 
    df2['b_reverse'] = df2['b'].apply(lambda val: val[::-1]) # apply to each value in b column
    
    print(df2)
    
    # output:
    #     a   b b_copy b_replace b_reverse
    # 0  a0  a1     a1        42        1a
    # 1  b0  b1     b1        42        1b
    

    Notice that pandas passed a row or a cell to the function you give as the first argument to apply, then stores the function's output in a column of your choice.

    If you'd like to modify a dataframe row-by-row, take a look at iterrows and loc for the most idiomatic route.