In the following dataframe of snapshots of a given system, I am interested in recording any changes in var1
or var2
over time, assuming that the state of the system remains the same until something changes. This is similar to run length encoding, which condenses sequences in which the same data values occur in many consecutive data elements. In that sense, I am interested in capturing the runs. For example:
var1 var2 timestamp
foo 2 2017-01-01 00:07:45
foo 2 2017-01-01 00:13:42
foo 3 2017-01-01 00:19:41
bar 3 2017-01-01 00:25:41
bar 2 2017-01-01 00:37:36
bar 2 2017-01-01 00:43:37
foo 2 2017-01-01 01:01:29
foo 2 2017-01-01 01:01:34
bar 2 2017-01-01 01:19:25
bar 2 2017-01-01 01:25:22
should be condensed to:
expected_output
var1 var2 min max
foo 2 2017-01-01 00:07:45 2017-01-01 00:19:41
foo 3 2017-01-01 00:19:41 2017-01-01 00:25:41
bar 3 2017-01-01 00:25:41 2017-01-01 00:37:36
bar 2 2017-01-01 00:37:36 2017-01-01 01:01:29
foo 2 2017-01-01 01:01:29 2017-01-01 01:19:25
bar 2 2017-01-01 01:25:22 None
I have tried the the following aggregation, which effectively deduplicates var1
and var2
and provide the min and max timestamps per group:
output = test.groupby(['var1','var2'])['timestamp'].agg(['min','max']).reset_index()
output
var1 var2 min max
bar 2 2017-01-01 00:37:36 2017-01-01 01:25:22
bar 3 2017-01-01 00:25:41 2017-01-01 00:25:41
foo 2 2017-01-01 00:07:45 2017-01-01 01:01:34
foo 3 2017-01-01 00:19:41 2017-01-01 00:19:41
However, var1
and var2
can change and revert back to the same original values over time, so a min/max function does not work since var1
and var2
should be compared to the previous value in the same column over time, similar to but not exactly what the shift()
method does.
Is there an efficient method in pandas or numpy, similar to the rle()
method in R, that would group or partition such runs and take the min timestamp of the next run as its max? The real dataset is over 10 million rows. Any suggestions here would be appreciated!
For contiguous grouping you can group on (df.col != df.col.shift()).cumsum()
You want it for either column so you can |
them together.
>>> ((df.var1 != df.var1.shift()) | (df.var2 != df.var2.shift())).cumsum()
0 1
1 1
2 2
3 3
4 4
5 4
6 5
7 5
8 6
9 6
dtype: int64
groupby + agg
>>> cond = ((df.var1 != df.var1.shift()) | (df.var2 != df.var2.shift())).cumsum()
>>> output = df.groupby(cond).agg(
... var1=('var1', 'first'),
... var2=('var2', 'first'),
... min=('timestamp', 'min'),
... max=('timestamp', 'max')
... )
>>> output
var1 var2 min max
1 foo 2 2017-01-01 00:07:45 2017-01-01 00:13:42
2 foo 3 2017-01-01 00:19:41 2017-01-01 00:19:41
3 bar 3 2017-01-01 00:25:41 2017-01-01 00:25:41
4 bar 2 2017-01-01 00:37:36 2017-01-01 00:43:37
5 foo 2 2017-01-01 01:01:29 2017-01-01 01:01:34
6 bar 2 2017-01-01 01:19:25 2017-01-01 01:25:22
You can then set the max to the next row's min:
>>> output['max'] = output['min'].shift(-1)
>>> output
var1 var2 min max
1 foo 2 2017-01-01 00:07:45 2017-01-01 00:19:41
2 foo 3 2017-01-01 00:19:41 2017-01-01 00:25:41
3 bar 3 2017-01-01 00:25:41 2017-01-01 00:37:36
4 bar 2 2017-01-01 00:37:36 2017-01-01 01:01:29
5 foo 2 2017-01-01 01:01:29 2017-01-01 01:19:25
6 bar 2 2017-01-01 01:19:25 NaN