Search code examples
pythonpandasgroup-bytime-seriespandas-resample

Group if difference of datetime index is less than 5 minutes of a pandas series


I want to perform a groupby.first() of a pandas timeseries where the datetime index is almost consecutive, where almost is less than 5 minutes of difference. I have seen a lot of material but never if the datetime is not consecutive like in my example:

ind=['2019-02-28 01:20:00', '2019-02-28 01:21:00','2019-02-28 01:22:00', '2019-02-28 01:23:00',
     '2019-02-28 01:24:00', '2019-02-28 01:25:00','2019-02-28 01:26:00', '2019-02-28 01:27:00',
     '2019-02-28 01:28:00', '2019-02-28 04:05:00','2019-02-28 04:06:00', '2019-02-28 04:07:00',
     '2019-02-28 04:08:00', '2019-02-28 04:09:00','2019-02-28 06:55:00', '2019-02-28 06:56:00',
     '2019-02-28 06:57:00', '2019-02-28 06:58:00','2019-02-28 09:50:00', '2019-02-28 09:51:00',
     '2019-02-28 09:52:00', '2019-02-28 09:53:00','2019-02-28 09:54:00', '2019-02-28 09:55:00',          
     '2019-02-28 09:56:00', '2019-02-28 09:57:00','2019-02-28 09:58:00', '2019-02-28 09:59:00',
     '2019-02-28 10:00:00']

val=[2.11, 2.24, 2.37, 2.42, 2.58, 2.71, 2.76, 3.06, 3.29, 2.04, 2.26,2.55, 2.89, 3.26, 2.2 , 2.54,
     2.85, 3.24, 2.2 , 2.12, 2.11, 2.07,2.1 , 2.16, 2.28, 2.35, 2.44, 2.5 , 2.57]

s = pd.Series(val,index=pd.to_datetime(ind))

My desidered output should be:

Datetime               Value
2019-02-28 01:20:00    2.11
2019-02-28 04:05:00    2.04
2019-02-28 06:55:00    2.20
2019-02-28 09:50:00    2.20

Anyone can help me?


Solution

  • Let us group the dataframe on blocks of consecutive rows where time difference is less than 5min:

    df = s.reset_index(name='Value')
    b  = df['index'].diff().dt.seconds.gt(300).cumsum()
    df = df.groupby(b, as_index=False).first()
    

    Explanations

    Reset the index of the given timeseries s then calculate the difference of datetime index compared to previous element and use dt.seconds to get the difference measured in seconds.

    >>> df['index'].diff().dt.seconds
    
    0         NaN
    1        60.0
    2        60.0
    3        60.0
    4        60.0
    5        60.0
    6        60.0
    7        60.0
    8        60.0
    9      9420.0
    ....
    25       60.0
    26       60.0
    27       60.0
    28       60.0
    Name: index, dtype: float64
    

    Now compare total seconds with 300 to create a boolean mask followed by cumsum to identify blocks of rows where the difference between consecutive datetime values is less than 5 min

    >>> df['index'].diff().dt.seconds.gt(300).cumsum()
    
    0     0
    1     0
    2     0
    3     0
    4     0
    5     0
    6     0
    7     0
    8     0
    9     1
    ...
    25    3
    26    3
    27    3
    28    3
    Name: index, dtype: int64
    

    Group the dataframe on the above blocks and aggregate using first

    >>> df
                    index  Value
    0 2019-02-28 01:20:00   2.11
    1 2019-02-28 04:05:00   2.04
    2 2019-02-28 06:55:00   2.20
    3 2019-02-28 09:50:00   2.20