Search code examples
pythonpandasdataframepandas-groupbycumsum

Subtract value in row based on condition in pandas


I need to subtract dates based on the progression of fault count. Below is the table that has the two input columns Date and Fault_Count. The output columns I need are Option1 and Option2. The last two columns show the date difference calculations. Basically when the Fault_Count changes I need to count the number of days from when the Fault_Count changed to the initial start of fault count. For example the Fault_Count changed to 2 on 1/4/2020, I need to get the number of days from when the Fault_Count started at 0 and changed to 2 (i.e. 1/4/2020 - 1/1/2020 = 3).

Date    Fault_Count Option1 Option2    Option1calc          Option2calc
1/1/2020       0       0        0       
1/2/2020       0       0        0       
1/3/2020       0       0        0       
1/4/2020       2       3        3   1/4/2020-1/1/2020    1/4/2020-1/1/2020
1/5/2020       2       0        0       
1/6/2020       2       0        0       
1/7/2020       4       3        3   1/7/2020-1/4/2020    1/7/2020-1/4/2020
1/8/2020       4       0        0       
1/9/2020       5       2        2   1/9/2020-1/7/2020    1/9/2020-1/7/2020
1/10/2020      5       0        0       
1/11/2020      0       2       -2   1/11/2020-1/9/2020   (1/11/2020-1/9/2020)*-1 as the fault resets
1/12/2020      1       1        1   1/12/2020-1/11/2020  1/12/2020-1/11/2020

Below is the code.

import pandas as pd

d = {'Date': ['1/1/2020', '1/2/2020', '1/3/2020', '1/4/2020', '1/5/2020', '1/6/2020', '1/7/2020', '1/8/2020', '1/9/2020', '1/10/2020', '1/11/2020', '1/12/2020'], 'Fault_Count' : [0, 0, 0, 2, 2, 2, 4, 4, 5, 5, 0, 1]}
df = pd.DataFrame(d)
df['Date'] = pd.to_datetime(df['Date'])
df['Fault_count_diff'] = df.Fault_Count.diff().fillna(0)
df['Cumlative_Sum'] = df.Fault_count_diff.cumsum()

I thought I could use cumulative sum and group by to get the groups and get the differences of the first value of groups. That's as far as I could get, also I noticed that using cumulative sum was not giving me ordered groups as some of the Fault_Count get reset.

    Date        Fault_Count   Fault_count_diff    Cumlative_Sum
0   2020-01-01       0               0.0                0.0
1   2020-01-02       0               0.0                0.0
2   2020-01-03       0               0.0                0.0
3   2020-01-04       2               2.0                2.0
4   2020-01-05       2               0.0                2.0
5   2020-01-06       2               0.0                2.0
6   2020-01-07       4               2.0                4.0
7   2020-01-08       4               0.0                4.0
8   2020-01-09       5               1.0                5.0
9   2020-01-10       5               0.0                5.0
10  2020-01-11       0              -5.0                0.0
11  2020-01-12       1               1.0                1.0

Desired output:

         Date  Fault_Count  Option1  Option2
0  2020-01-01            0      0.0      0.0
1  2020-01-02            0      0.0      0.0
2  2020-01-03            0      0.0      0.0
3  2020-01-04            2      3.0      3.0
4  2020-01-05            2      0.0      0.0
5  2020-01-06            2      0.0      0.0
6  2020-01-07            4      3.0      3.0
7  2020-01-08            4      0.0      0.0
8  2020-01-09            5      2.0      2.0
9  2020-01-10            5      0.0      0.0
10 2020-01-11            0      2.0     -2.0
11 2020-01-12            1      1.0      1.0

Thanks for the help.


Solution

  • Use:

    m1 = df['Fault_Count'].ne(df['Fault_Count'].shift(fill_value=0))
    m2 = df['Fault_Count'].eq(0) & df['Fault_Count'].shift(fill_value=0).ne(0)
    
    s = df['Date'].groupby(m1.cumsum()).transform('first')
    
    df['Option1'] = df['Date'].sub(s.shift()).dt.days.where(m1, 0)
    df['Option2'] = df['Option1'].where(~m2, df['Option1'].mul(-1))
    

    Details:

    Use Series.ne + Series.shift to create boolean mask m1 which represent the boundary condition when Fault_count changes, similarly use Series.eq + Series.shift and Series.ne to create a boolean mask m2 which represent the condition where Fault_count resets:

              m1           m2
    0         False        False
    1         False        False
    2         False        False
    3          True        False
    4         False        False
    5         False        False
    6          True        False
    7         False        False
    8          True        False
    9         False        False
    10         True         True # --> Fault count reset
    11         True        False
    

    Use Series.groupby on consecutive fault counts obtained using m1.cumsum and transform the Date column using groupby.first:

    print(s)
    0    2020-01-01
    1    2020-01-01
    2    2020-01-01
    3    2020-01-04
    4    2020-01-04
    5    2020-01-04
    6    2020-01-07
    7    2020-01-07
    8    2020-01-09
    9    2020-01-09
    10   2020-01-11
    11   2020-01-12
    Name: Date, dtype: datetime64[ns]
    

    Use Series.sub to subtract Date for s shifted using Series.shift and use Series.where to fill 0 based on mask m2 and assign this to Option1. Similary we obtain Option2 from Option1 based on mask m2:

    print(df)
             Date  Fault_Count  Option1  Option2
    0  2020-01-01            0      0.0      0.0
    1  2020-01-02            0      0.0      0.0
    2  2020-01-03            0      0.0      0.0
    3  2020-01-04            2      3.0      3.0
    4  2020-01-05            2      0.0      0.0
    5  2020-01-06            2      0.0      0.0
    6  2020-01-07            4      3.0      3.0
    7  2020-01-08            4      0.0      0.0
    8  2020-01-09            5      2.0      2.0
    9  2020-01-10            5      0.0      0.0
    10 2020-01-11            0      2.0     -2.0
    11 2020-01-12            1      1.0      1.0