Search code examples
pythonpandasdataframetime-seriesplotly

Dataframe - how to insert new row with null value, conditionally based on elapsed time?


Background: My dataset aquires values at roughly 5 minute intervals, but sometimes there are gaps. I am charting my dataset using Plotly and attempting to resolve an issue where a straight line is drawn between points if there is a gap in the dataset. Plotly has a parameter connectgaps which if set to false will not connect over 'nan' values. However, my dataset looks like this:

(where I have computed the time difference using df['time_diff_mins'] = (df['datetime'].shift(-1) - df['datetime']).dt.total_seconds() / 60)

     datetime                 value   time_diff_mins
0    2022-03-09 09:25:00      98      5
1    2022-03-09 09:30:00      104     21
2    2022-03-09 09:51:00      105     3
3    2022-03-09 09:54:00      110     nan

If you look at rows 1 and 2, the time difference is 21 minutes. For this reason, I don't want the values 104 and 105 to be connected - I want a break in the line if there is a gap of greater than 15 mins and 15 seconds.

So, I am trying to insert a new row with null/nan values in my dataframe if the time difference between rows is greater than 15 mins and 15 seconds, so that Plotly will not connect the gaps.

Desired output:

     datetime                 value 
0    2022-03-09 09:25:00      98     
1    2022-03-09 09:30:00      104     
2    2022-03-09 09:40:30      nan
3    2022-03-09 09:51:00      105     
4    2022-03-09 09:54:00      110

I hope that makes sense. I know that inserting rows programmatically is probably not an optimal solution, so I haven't been able to find a good answer to this.


Solution

  • You can use a mask and pandas.concat

    df['datetime'] = pd.to_datetime(df['datetime'])
    
    delta = '15 min 15 s'
    
    d = df['datetime'].diff().shift(-1)
    
    out = (pd.concat([df, 
                      df['datetime'].add(d/2).
                        .loc[d.gt(delta)].to_frame()
                      ])
             .sort_index()
           )
    

    Output:

                 datetime  value  time_diff_mins
    0 2022-03-09 09:25:00   98.0             5.0
    1 2022-03-09 09:30:00  104.0            21.0
    1 2022-03-09 09:40:30    NaN             NaN
    2 2022-03-09 09:51:00  105.0             3.0
    3 2022-03-09 09:54:00  110.0             NaN