I need to subtract dates based on the progression of fault count.
Below is the table that has the two input columns Date
and Fault_Count
. The output columns I need are Option1
and Option2
. The last two columns show the date difference calculations. Basically when the Fault_Count changes I need to count the number of days from when the Fault_Count changed to the initial start of fault count. For example the Fault_Count changed to 2 on 1/4/2020
, I need to get the number of days from when the Fault_Count
started at 0
and changed to 2 (i.e. 1/4/2020
- 1/1/2020
= 3
).
Date Fault_Count Option1 Option2 Option1calc Option2calc
1/1/2020 0 0 0
1/2/2020 0 0 0
1/3/2020 0 0 0
1/4/2020 2 3 3 1/4/2020-1/1/2020 1/4/2020-1/1/2020
1/5/2020 2 0 0
1/6/2020 2 0 0
1/7/2020 4 3 3 1/7/2020-1/4/2020 1/7/2020-1/4/2020
1/8/2020 4 0 0
1/9/2020 5 2 2 1/9/2020-1/7/2020 1/9/2020-1/7/2020
1/10/2020 5 0 0
1/11/2020 0 2 -2 1/11/2020-1/9/2020 (1/11/2020-1/9/2020)*-1 as the fault resets
1/12/2020 1 1 1 1/12/2020-1/11/2020 1/12/2020-1/11/2020
Below is the code.
import pandas as pd
d = {'Date': ['1/1/2020', '1/2/2020', '1/3/2020', '1/4/2020', '1/5/2020', '1/6/2020', '1/7/2020', '1/8/2020', '1/9/2020', '1/10/2020', '1/11/2020', '1/12/2020'], 'Fault_Count' : [0, 0, 0, 2, 2, 2, 4, 4, 5, 5, 0, 1]}
df = pd.DataFrame(d)
df['Date'] = pd.to_datetime(df['Date'])
df['Fault_count_diff'] = df.Fault_Count.diff().fillna(0)
df['Cumlative_Sum'] = df.Fault_count_diff.cumsum()
I thought I could use cumulative sum and group by to get the groups and get the differences of the first value of groups. That's as far as I could get, also I noticed that using cumulative sum was not giving me ordered groups as some of the Fault_Count
get reset.
Date Fault_Count Fault_count_diff Cumlative_Sum
0 2020-01-01 0 0.0 0.0
1 2020-01-02 0 0.0 0.0
2 2020-01-03 0 0.0 0.0
3 2020-01-04 2 2.0 2.0
4 2020-01-05 2 0.0 2.0
5 2020-01-06 2 0.0 2.0
6 2020-01-07 4 2.0 4.0
7 2020-01-08 4 0.0 4.0
8 2020-01-09 5 1.0 5.0
9 2020-01-10 5 0.0 5.0
10 2020-01-11 0 -5.0 0.0
11 2020-01-12 1 1.0 1.0
Desired output:
Date Fault_Count Option1 Option2
0 2020-01-01 0 0.0 0.0
1 2020-01-02 0 0.0 0.0
2 2020-01-03 0 0.0 0.0
3 2020-01-04 2 3.0 3.0
4 2020-01-05 2 0.0 0.0
5 2020-01-06 2 0.0 0.0
6 2020-01-07 4 3.0 3.0
7 2020-01-08 4 0.0 0.0
8 2020-01-09 5 2.0 2.0
9 2020-01-10 5 0.0 0.0
10 2020-01-11 0 2.0 -2.0
11 2020-01-12 1 1.0 1.0
Thanks for the help.
Use:
m1 = df['Fault_Count'].ne(df['Fault_Count'].shift(fill_value=0))
m2 = df['Fault_Count'].eq(0) & df['Fault_Count'].shift(fill_value=0).ne(0)
s = df['Date'].groupby(m1.cumsum()).transform('first')
df['Option1'] = df['Date'].sub(s.shift()).dt.days.where(m1, 0)
df['Option2'] = df['Option1'].where(~m2, df['Option1'].mul(-1))
Details:
Use Series.ne
+ Series.shift
to create boolean mask m1
which represent the boundary condition when Fault_count
changes, similarly use Series.eq
+ Series.shift
and Series.ne
to create a boolean mask m2
which represent the condition where Fault_count
resets:
m1 m2
0 False False
1 False False
2 False False
3 True False
4 False False
5 False False
6 True False
7 False False
8 True False
9 False False
10 True True # --> Fault count reset
11 True False
Use Series.groupby
on consecutive fault counts obtained using m1.cumsum
and transform the Date
column using groupby.first
:
print(s)
0 2020-01-01
1 2020-01-01
2 2020-01-01
3 2020-01-04
4 2020-01-04
5 2020-01-04
6 2020-01-07
7 2020-01-07
8 2020-01-09
9 2020-01-09
10 2020-01-11
11 2020-01-12
Name: Date, dtype: datetime64[ns]
Use Series.sub
to subtract Date
for s
shifted using Series.shift
and use Series.where
to fill 0
based on mask m2
and assign this to Option1
. Similary we obtain Option2
from Option1
based on mask m2
:
print(df)
Date Fault_Count Option1 Option2
0 2020-01-01 0 0.0 0.0
1 2020-01-02 0 0.0 0.0
2 2020-01-03 0 0.0 0.0
3 2020-01-04 2 3.0 3.0
4 2020-01-05 2 0.0 0.0
5 2020-01-06 2 0.0 0.0
6 2020-01-07 4 3.0 3.0
7 2020-01-08 4 0.0 0.0
8 2020-01-09 5 2.0 2.0
9 2020-01-10 5 0.0 0.0
10 2020-01-11 0 2.0 -2.0
11 2020-01-12 1 1.0 1.0