I have a data frame that looks as follows:
df = pd.DataFrame({"A":[10,0,30,40,0,60,70,80,90]}, index = pd.date_range(start='1/1/2020', end='1/09/2020'))
df
A
2020-01-01 10
2020-01-02 0
2020-01-03 30
2020-01-04 40
2020-01-05 0
2020-01-06 60
2020-01-07 70
2020-01-08 80
2020-01-09 90
I want to loop over the data frame to get t0
and t1
which represent the first and last dates, respectively (i.e. index value) for each range of non-zero values of A
.
In the table above, I want to get the following values for t0
and t1
:
t0 = 2020-01-01
, t1 = 2020-01-01
t0 = 2020-01-03
, t1 = 2020-01-04
t0 = 2020-01-06
, t1 = 2020-01-09
Is there a simple way to do this within Pandas?
One option would be to create a Boolean Index based on where A=0. Then groupby
the cumsum
of that Index and aggregate
the first
and last
dates (assumes index is sorted as it is here):
new_df = df.reset_index()
m = new_df['A'].eq(0)
new_df = (
new_df.groupby(m.cumsum()[~m])
.agg(t0=('index', 'first'), t1=('index', 'last'))
.reset_index(drop=True)
)
t0 t1
0 2020-01-01 2020-01-01
1 2020-01-03 2020-01-04
2 2020-01-06 2020-01-09
An alternative to grab min
and max
index values instead of first
and last
:
new_df = df.reset_index()
m = new_df['A'].eq(0)
new_df = (
new_df.groupby(m.cumsum()[~m])
.agg(t0=('index', 'min'), t1=('index', 'max'))
.reset_index(drop=True)
)
t0 t1
0 2020-01-01 2020-01-01
1 2020-01-03 2020-01-04
2 2020-01-06 2020-01-09
Indexing:
m
:
0 False
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 False
Name: A, dtype: bool
Create Groups Separated by where 0s are:
m.cumsum()
0 0
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 2
Name: A, dtype: int32
Self filtered to exclude 0 rows:
m.cumsum()[~m]
0 0
2 1
3 1
5 2
6 2
7 2
8 2
Name: A, dtype: int32
Then group these rows together to determine associated dates.