Search code examples
pythonpandasdatetimeepoch

Adding missing time in pandas dataframe


I have a pandas.DataFrame with timestamps in a column. The values are in epoch and 0.1 seconds apart. Values like 1488771900.100000, 1488771900.200000 and so on. However, there are missing values. So I have 1488794389.500000 and then 1488794389.900000 with 3 missing values between. I want to insert rows in the dataframe with missing values between the max and min in this column. So if the min is 1488771900.000000 and max is 1488794660.000000, I want to insert rows with all values separated by 0.1 seconds and NA in all other columns.

I saw an answer in this link, but wasn't able to replicate the same.

How can this operation be performed?


Solution

  • You can fill in your missing times using pandas.DataFrame.resample. The caveat is that the dataframe needs to have a pandas.DateTimeIndex. In your case the time is likely stored as a float in seconds since epoch, and this needs to be converted prior to re-sampling. Here is a function which will perform that operation.

    Code:

    import datetime as dt
    import pandas as pd
    
    def resample(dataframe, time_column, sample_period):
        # make a copy of the dataframe
        dataframe = dataframe.copy()
    
        # convert epoch times to datetime
        dataframe.time = dataframe.time.apply(
            lambda ts: dt.datetime.fromtimestamp(ts))
    
        # make the datetimes into an index
        dataframe.set_index(time_column, inplace=True)
    
        # resample to desired period
        dataframe = dataframe.resample(sample_period).asfreq().reset_index()
    
        # convert datetimes back to epoch
        epoch = dt.datetime.fromtimestamp(0)
        dataframe.time = dataframe.time.apply(
            lambda ts: (ts - epoch).total_seconds())
        return dataframe
    

    Test Code:

    values = [
        (1488771900.10, 'a'),
        (1488771900.20, 'b'),
        (1488771900.30, 'c'),
        (1488771900.60, 'f'),
    ]
    columns = ['time', 'value']
    df = pd.DataFrame(values, columns=columns)
    print(df)
    
    new_df = resample(df, 'time', '100ms')
    print(new_df)
    

    Results:

               time value
    0  1.488772e+09     a
    1  1.488772e+09     b
    2  1.488772e+09     c
    3  1.488772e+09     f
    
               time value
    0  1.488772e+09     a
    1  1.488772e+09     b
    2  1.488772e+09     c
    3  1.488772e+09   NaN
    4  1.488772e+09   NaN
    5  1.488772e+09     f