From my understanding, pandas.DataFrame.apply does not apply changes inplace and we should use its return object to persist any changes. However, I've found the following inconsistent behavior:
Let's apply a dummy function for the sake of ensuring that the original df remains untouched:
>>> def foo(row: pd.Series):
... row['b'] = '42'
>>> df = pd.DataFrame([('a0','b0'),('a1','b1')], columns=['a', 'b'])
>>> df.apply(foo, axis=1)
>>> df
a b
0 a0 b0
1 a1 b1
This behaves as expected. However, foo will apply the changes inplace if we modify the way we initialize this df:
>>> df2 = pd.DataFrame(columns=['a', 'b'])
>>> df2['a'] = ['a0','a1']
>>> df2['b'] = ['b0','b1']
>>> df2.apply(foo, axis=1)
>>> df2
a b
0 a0 42
1 a1 42
I've also noticed that the above is not true if the columns dtypes are not of type 'object'. Why does apply() behave differently in these two contexts?
Python: 3.6.5
Pandas: 0.23.1
Interesting question! I believe the behavior you're seeing is an artifact of the way you use apply
.
As you correctly indicate, apply
is not intended to be used to modify a dataframe. However, since apply
takes an arbitrary function, it doesn't guarantee that applying the function will be idempotent and will not change the dataframe. Here, you've found a great example of that behavior, because your function foo
attempts to modify the row that it is passed by apply
.
Using apply
to modify a row could lead to these side effects. This isn't the best practice.
Instead, consider this idiomatic approach for apply
. The function apply
is often used to create a new column. Here's an example of how apply
is typically used, which I believe would steer you away from this potentially troublesome area:
import pandas as pd
# construct df2 just like you did
df2 = pd.DataFrame(columns=['a', 'b'])
df2['a'] = ['a0','b0']
df2['b'] = ['a1','b1']
df2['b_copy'] = df2.apply(lambda row: row['b'], axis=1) # apply to each row
df2['b_replace'] = df2.apply(lambda row: '42', axis=1)
df2['b_reverse'] = df2['b'].apply(lambda val: val[::-1]) # apply to each value in b column
print(df2)
# output:
# a b b_copy b_replace b_reverse
# 0 a0 a1 a1 42 1a
# 1 b0 b1 b1 42 1b
Notice that pandas passed a row or a cell to the function you give as the first argument to apply
, then stores the function's output in a column of your choice.
If you'd like to modify a dataframe row-by-row, take a look at iterrows
and loc
for the most idiomatic route.