I am trying to use the pandas.DataFrame.diff function to calculate the difference between rows in a dataframe. The catch is I only want to calculate the difference for certain values using some simple conditional logic applied to the value in the 'data' column (i.e. if 'data' column value is 0 then set the difference to 0. Note it doesn't have to be set to 0 it could also be set as NaN).
The issue is similar to How do I create a new column where the values are selected based on an existing column?. However, I have not been unable to make apply this solution here.
Sample code is provided below and the table shows current results and the last column with my desired result.
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({'data': [10, 20, 0, 25, 0, 0, 30, 25, 40]})
df['diff'] = df.data.diff()
Table showing results and desired result in last column
value | diff | desired_result
10 NaN NaN
20 10 10
0 -20 0
25 25 25
0 -25 0
0 0 0
30 30 30
25 -5 -5
40 15 15
I have tried getting the following code with conditional logic however this does not work
df['diff'] = df['data'].apply(lambda x: 0 if x == 0 else df.data.diff())
You can add Series.mask
for set 0
:
df['diff'] = df.data.diff().mask(df['data'].eq(0), 0)
print (df)
data diff
0 10 NaN
1 20 10.0
2 0 0.0
3 25 25.0
4 0 0.0
5 0 0.0
6 30 30.0
7 25 -5.0
8 40 15.0
Or numpy.where
:
df['diff'] = np.where(df['data'].eq(0), 0, df.data.diff())
print (df)
data diff
0 10 NaN
1 20 10.0
2 0 0.0
3 25 25.0
4 0 0.0
5 0 0.0
6 30 30.0
7 25 -5.0
8 40 15.0