Search code examples
pythonpandasdataframe

Calculate difference between rows in Pandas dataframe using conditional logic


I am trying to use the pandas.DataFrame.diff function to calculate the difference between rows in a dataframe. The catch is I only want to calculate the difference for certain values using some simple conditional logic applied to the value in the 'data' column (i.e. if 'data' column value is 0 then set the difference to 0. Note it doesn't have to be set to 0 it could also be set as NaN).

The issue is similar to How do I create a new column where the values are selected based on an existing column?. However, I have not been unable to make apply this solution here.

Sample code is provided below and the table shows current results and the last column with my desired result.

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({'data': [10, 20, 0, 25, 0, 0, 30, 25, 40]})

df['diff'] = df.data.diff()

Table showing results and desired result in last column

value | diff | desired_result
10       NaN       NaN
20        10       10
 0       -20        0
25        25       25 
 0       -25        0
 0         0        0 
30        30       30
25        -5       -5
40        15       15

I have tried getting the following code with conditional logic however this does not work

df['diff'] = df['data'].apply(lambda x: 0 if x == 0 else df.data.diff())

Solution

  • You can add Series.mask for set 0:

    df['diff'] = df.data.diff().mask(df['data'].eq(0), 0)
    
    print (df)
       data  diff
    0    10   NaN
    1    20  10.0
    2     0   0.0
    3    25  25.0
    4     0   0.0
    5     0   0.0
    6    30  30.0
    7    25  -5.0
    8    40  15.0
    

    Or numpy.where:

    df['diff'] = np.where(df['data'].eq(0), 0, df.data.diff())
    
    print (df)
       data  diff
    0    10   NaN
    1    20  10.0
    2     0   0.0
    3    25  25.0
    4     0   0.0
    5     0   0.0
    6    30  30.0
    7    25  -5.0
    8    40  15.0