I have a table containing non-negative values against a datetimeindex, like the following:
CapturableSeparation
date
2021-02-23 18:09:00 0.00
2021-02-23 18:10:00 0.00
2021-02-23 18:11:00 0.04
2021-02-23 18:12:00 0.04
2021-02-23 18:13:00 0.00
... ...
2021-02-25 23:56:00 0.00
2021-02-25 23:57:00 0.91
2021-02-25 23:58:00 0.74
2021-02-25 23:59:00 0.55
I want to create a table of the amount of time between non-consecutive 0s (amount of time that positive values persist before reverting to 0) keyed with the average value of the "CapturableSeparation" during those consecutive positive values. For the data that is visible, the table might look like:
AvgValue
persistence
00:02:00 0.04
00:03:00 0.73
where the first row corresponds to the positive values at the beginning of the Dataframe that persist for 2 minutes and the second row corresponds to those at the end that persist for 3 minutes.
How should this be done?
Here is one way of solving the problem by identifying the consecutive blocks of non-zero values using boolean masking and cumsum
:
m = df['CapturableSeparation'].eq(0)
b = m.cumsum()[~m]
agg_dict = {'persistence': ('date', np.ptp),
'avgvalue' : ('CapturableSeparation', 'mean')}
out = df.groupby(b, as_index=False).agg(**agg_dict)
out['persistence'] += pd.Timedelta(minutes=1)
Details:
Compare the CapturableSeparation
column with 0
to create a boolean mask:
>>> m
0 True
1 True
2 False
3 False
4 True
5 True
6 False
7 False
8 False
Name: CapturableSeparation, dtype: bool
Then use cumsum
on the above boolean mask to identify the blocks of consecutive non-zero values:
>>> b
2 2
3 2
6 4
7 4
8 4
Name: CapturableSeparation, dtype: int64
Group the dataframe on these consecutive blocks and aggregate the column date
using np.ptp
and column CapturableSeparation
using mean
:
>>> out
persistence avgvalue
0 0 days 00:02:00 0.040000
1 0 days 00:03:00 0.733333