Search code examples
pythonpandastime-seriessubsequence

Find index of longest consectuive timespan in pandas time series


I have a time series with gaps (NaN) and I want to find the start and stop index of the longest consecutive sequence where no Nan occurs. I have no clue how to do that.

values = [5468.0,
 5946.0,
 np.nan,
 6019.0,
 5797.0,
 5879.0,
 np.nan,
 5706.0,
 5986.0,
 6228.0,
 6285.0,
 np.nan,
 5667.0,
 5886.0,
 6380.0,
 5988.0,
 6290.0,
 5899.0,
 6429.0,
 6177.0]

dates = [Timestamp('2018-10-17 13:30:00'),
 Timestamp('2018-10-17 14:00:00'),
 Timestamp('2018-10-17 14:30:00'),
 Timestamp('2018-10-17 15:00:00'),
 Timestamp('2018-10-17 15:30:00'),
 Timestamp('2018-10-17 16:00:00'),
 Timestamp('2018-10-17 16:30:00'),
 Timestamp('2018-10-17 17:00:00'),
 Timestamp('2018-10-17 17:30:00'),
 Timestamp('2018-10-17 18:00:00'),
 Timestamp('2018-10-17 18:30:00'),
 Timestamp('2018-10-17 19:00:00'),
 Timestamp('2018-10-17 19:30:00'),
 Timestamp('2018-10-17 20:00:00'),
 Timestamp('2018-10-17 20:30:00'),
 Timestamp('2018-10-17 21:00:00'),
 Timestamp('2018-10-17 21:30:00'),
 Timestamp('2018-10-17 22:00:00'),
 Timestamp('2018-10-17 22:30:00'),
 Timestamp('2018-10-17 23:00:00')]

I found a lot of solutions here on stack, but they all use days and then count with +-1 but in case of my 30 mins frequency this doesn't work.

I know that I can get True/False with isnull() and then groupby() or use dates.diff()[1:] but I have to less knowledge to find a solution.


Solution

  • You are almost there. Create groups then find the difference between the second timestamp (the first has NaN) and the last of the series:

    >>> (df.groupby(df['value'].isna().cumsum())['date']
           .apply(lambda x: x.iloc[-1] - x.iloc[1])  # 1 to remove NaN
           .max())
    
    Timedelta('0 days 03:30:00')
    

    Input dataframe:

    >>> df
                      date   value
    0  2018-10-17 13:30:00  5468.0  # Group 0
    1  2018-10-17 14:00:00  5946.0  # Group 0
    2  2018-10-17 14:30:00     NaN  # Group 1
    3  2018-10-17 15:00:00  6019.0  # Group 1
    4  2018-10-17 15:30:00  5797.0  # Group 1
    5  2018-10-17 16:00:00  5879.0  # Group 1
    6  2018-10-17 16:30:00     NaN  # Group 2
    7  2018-10-17 17:00:00  5706.0  # Group 2
    8  2018-10-17 17:30:00  5986.0  # Group 2
    9  2018-10-17 18:00:00  6228.0  # Group 2
    10 2018-10-17 18:30:00  6285.0  # Group 2
    11 2018-10-17 19:00:00     NaN  # Group 3
    12 2018-10-17 19:30:00  5667.0  # Group 3
    13 2018-10-17 20:00:00  5886.0  # Group 3
    14 2018-10-17 20:30:00  6380.0  # Group 3
    15 2018-10-17 21:00:00  5988.0  # Group 3
    16 2018-10-17 21:30:00  6290.0  # Group 3
    17 2018-10-17 22:00:00  5899.0  # Group 3
    18 2018-10-17 22:30:00  6429.0  # Group 3
    19 2018-10-17 23:00:00  6177.0  # Group 3
    

    Update To extract the longest series:

    groups = df['value'].isna().cumsum()
    best_group = df.groupby(groups)['date'].apply(lambda x: x.iloc[-1] - x.iloc[1]).idxmax()
    out = df[groups == best_group]
    print(out)
    
    # Output
                      date   value
    11 2018-10-17 19:00:00     NaN
    12 2018-10-17 19:30:00  5667.0
    13 2018-10-17 20:00:00  5886.0
    14 2018-10-17 20:30:00  6380.0
    15 2018-10-17 21:00:00  5988.0
    16 2018-10-17 21:30:00  6290.0
    17 2018-10-17 22:00:00  5899.0
    18 2018-10-17 22:30:00  6429.0
    19 2018-10-17 23:00:00  6177.0