Search code examples
pythonpandasdataframemissing-data

Add missing Data to pandas DataFrame depending on multiple conditions


I have a pd.DataFrame that looks like this:

                datetime ... month  year seconds
0    2023-03-02 20:59:00 ...     3  2023   75540
1    2023-03-02 20:58:00 ...     3  2023   75480
2    2023-03-02 20:57:00 ...     3  2023   75420
3    2023-03-02 20:56:00 ...     3  2023   75360
4    2023-03-02 20:55:00 ...     3  2023   75300
..                   ... ...   ...   ...     ...
775  2023-03-01 14:34:00 ...     3  2023   52440
776  2023-03-01 14:33:00 ...     3  2023   52380
777  2023-03-01 14:32:00 ...     3  2023   52320
778  2023-03-01 14:31:00 ...     3  2023   52260
779  2023-03-01 14:30:00 ...     3  2023   52200

[780 rows x 11 columns]

The datetime object can only contain values that are between 14:30:00 and 20:59:00. If some values are outside of this range, they need to be deleted. Also, the gap between two rows should be exactly 60 seconds every time, except when changing from one day to another (when df['seconds']== 52200). Notice that the data can span over multiple days.

The first row of the df does not have to be 20:59:00, it could start at 16:30:00 for example. The same goes for the last row, meaning it is not always 14:30:00. Some rows are randomly missing, and I want to add them in.

The datetime, month, year, seconds columns should be the missing times, while the values in the other columns should be taking the value between the two rows surrounding it. Say for example we have :

                  datetime     x month  year seconds
299    2023-03-02 18:59:00   200     3  2023   68340
300    2023-03-02 18:56:00   230     3  2023   68160

I need to insert two rows (as the difference in seconds is 180, we have 2 missing minutes). The x column for each of these rows should be respectively 210 and 220. I tried doing this using a while loop, but it's extremely inefficient - as I sometimes have df with over 400'000 rows. Is there a way to do this using the pandas library, or anything else that would be more efficient than iterating over each pair of row, checking how many rows have to be included, then inserting and reseting the index?


Solution

  • Use:

    #convert to datetime
    df['datetime'] = pd.to_datetime(df['datetime'])
    
    #create MultiIndex by dates and seconds columns
    out = df.set_index([df['datetime'].dt.normalize(), 'seconds'])
    
    #create full MultiIndex by all combinations of all seconds and all existing dates
    mux = pd.MultiIndex.from_product([out.index.levels[0], range(52200, 75540 +60, 60)],
                                     names=['date','seconds'])
    
    #add missing rows
    out = out.reindex(mux).reset_index()
    
    
    #if necessary remove per dates rows before minimal seconds and after maximal seconds
    mask = out['datetime'].notna()
    
    out = out[mask.groupby(out['date']).cummax() & 
              mask.iloc[::-1].groupby(out['date'].iloc[::-1]).cummax().iloc[::-1]]
    
    #interpolate x column
    out['x'] = out['x'].interpolate()
    #extract monts, years, count datetimes from seconds
    out['month'] = out['date'].dt.month
    out['year'] = out['date'].dt.year
    out['datetime'] = out['date'] + pd.to_timedelta(out['seconds'], 'seconds')