python pandas numpy duplicates partitioning

Aggregating row repeats in pandas (run lengths)

In the following dataframe of snapshots of a given system, I am interested in recording any changes in var1 or var2 over time, assuming that the state of the system remains the same until something changes. This is similar to run length encoding, which condenses sequences in which the same data values occur in many consecutive data elements. In that sense, I am interested in capturing the runs. For example:

    var1 var2   timestamp
    foo  2  2017-01-01 00:07:45
    foo  2  2017-01-01 00:13:42
    foo  3  2017-01-01 00:19:41
    bar  3  2017-01-01 00:25:41
    bar  2  2017-01-01 00:37:36
    bar  2  2017-01-01 00:43:37
    foo  2  2017-01-01 01:01:29
    foo  2  2017-01-01 01:01:34
    bar  2  2017-01-01 01:19:25
    bar  2  2017-01-01 01:25:22

should be condensed to:

expected_output

       var1 var2       min              max
        foo 2   2017-01-01 00:07:45 2017-01-01 00:19:41
        foo 3   2017-01-01 00:19:41 2017-01-01 00:25:41
        bar 3   2017-01-01 00:25:41 2017-01-01 00:37:36
        bar 2   2017-01-01 00:37:36 2017-01-01 01:01:29
        foo 2   2017-01-01 01:01:29 2017-01-01 01:19:25
        bar 2   2017-01-01 01:25:22 None

I have tried the the following aggregation, which effectively deduplicates var1 and var2 and provide the min and max timestamps per group:

output = test.groupby(['var1','var2'])['timestamp'].agg(['min','max']).reset_index()

output

   var1 var2       min              max
    bar 2   2017-01-01 00:37:36 2017-01-01 01:25:22
    bar 3   2017-01-01 00:25:41 2017-01-01 00:25:41
    foo 2   2017-01-01 00:07:45 2017-01-01 01:01:34
    foo 3   2017-01-01 00:19:41 2017-01-01 00:19:41

However, var1 and var2 can change and revert back to the same original values over time, so a min/max function does not work since var1 and var2 should be compared to the previous value in the same column over time, similar to but not exactly what the shift() method does.

Is there an efficient method in pandas or numpy, similar to the rle() method in R, that would group or partition such runs and take the min timestamp of the next run as its max? The real dataset is over 10 million rows. Any suggestions here would be appreciated!

Solution

For contiguous grouping you can group on (df.col != df.col.shift()).cumsum()

You want it for either column so you can | them together.

>>> ((df.var1 != df.var1.shift()) | (df.var2 != df.var2.shift())).cumsum()
0    1
1    1
2    2
3    3
4    4
5    4
6    5
7    5
8    6
9    6
dtype: int64

groupby + agg

>>> cond = ((df.var1 != df.var1.shift()) | (df.var2 != df.var2.shift())).cumsum()
>>> output = df.groupby(cond).agg(
...     var1=('var1', 'first'),
...     var2=('var2', 'first'),
...     min=('timestamp', 'min'),
...     max=('timestamp', 'max')
... )
>>> output
  var1  var2                  min                  max
1  foo     2  2017-01-01 00:07:45  2017-01-01 00:13:42
2  foo     3  2017-01-01 00:19:41  2017-01-01 00:19:41
3  bar     3  2017-01-01 00:25:41  2017-01-01 00:25:41
4  bar     2  2017-01-01 00:37:36  2017-01-01 00:43:37
5  foo     2  2017-01-01 01:01:29  2017-01-01 01:01:34
6  bar     2  2017-01-01 01:19:25  2017-01-01 01:25:22

You can then set the max to the next row's min:

>>> output['max'] = output['min'].shift(-1)
>>> output
  var1  var2                  min                  max
1  foo     2  2017-01-01 00:07:45  2017-01-01 00:19:41
2  foo     3  2017-01-01 00:19:41  2017-01-01 00:25:41
3  bar     3  2017-01-01 00:25:41  2017-01-01 00:37:36
4  bar     2  2017-01-01 00:37:36  2017-01-01 01:01:29
5  foo     2  2017-01-01 01:01:29  2017-01-01 01:19:25
6  bar     2  2017-01-01 01:19:25                  NaN