Search code examples
pythonpandastime-seriesmissing-data

How to detect missing values in a time series for long periods


I have electrical consumption data between 2016-2019. The data was recorded every 30 minutes for 4 years. There is no data between 13/03/2019 - 31/03/209.

enter image description here

  • I want to ask how I can detect this miss by coding without visualization because I have 12 countries and they may have such missing values in other months and they are not visible. (Detect if there is a miss for more than 3 consecutive days). Thank you for your help!

This is the data:

        Country Code   Electric Consumption (MW)
Date (index)        
2016-01-01    84              354642.0
2016-01-02    84              376207.0
2016-01-03    84              381534.0
2016-01-04    84              435561.0
2016-01-05    84              447820.0

... ... ...
2019-12-27    12              374340.0
2019-12-28    12              372761.0
2019-12-29    12              379411.0
2019-12-30    12              416044.0
2019-12-31    12              87519.0

Solution

  • Here is a way to identify the 3 day gaps and fill them. Note that this is working on each unique country code. You can save the final_dfs with a list and use pd.concat() if you need to get them back together again.

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'Country Code': {'2016-01-01': 84,
      '2016-01-02': 84,
      '2016-01-03': 84,
      '2016-01-04': 84,
      '2016-01-05': 84,
      '2019-12-27': 12,
      '2019-12-28': 12,
      '2019-12-29': 12,
      '2019-12-30': 12,
      '2019-12-31': 12},
     'Electric Consumption (MW)': {'2016-01-01': 354642.0,
      '2016-01-02': 376207.0,
      '2016-01-03': 381534.0,
      '2016-01-04': 435561.0,
      '2016-01-05': 447820.0,
      '2019-12-27': 374340.0,
      '2019-12-28': 372761.0,
      '2019-12-29': 379411.0,
      '2019-12-30': 416044.0,
      '2019-12-31': 87519.0}})
    
    
    # change index value for fake gap
    df.index = ['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
                   '2016-01-05', '2019-12-27', '2019-12-28',
                   '2019-12-29', '2019-12-30', '2020-01-06']
    
    #convert object dates to datetime
    df.index = pd.to_datetime(df.index)
    
    
    for g in df['Country Code'].unique():
        
        # look at each unique country 
        country_slice = df.loc[df['Country Code'] == g]
        # use timedelta to identify 3 day gaps
        country_slice['Three Day Gap'] = country_slice.index.to_series().diff() > pd.Timedelta('3d')
        # create a new index with previous min and max
        idx = pd.date_range(country_slice.index.min(), country_slice.index.max())
    
        s = country_slice['Electric Consumption (MW)']
    
        s.index = pd.DatetimeIndex(s.index)
        # this gives us a series with new rows and nans for the missing dates
        s = s.reindex(idx, fill_value=np.nan)
        # join the old data back to nex index
        country_slice_join = country_slice.join(s, how='outer', lsuffix='L')
        # now we can interpolate as missing dates are new rows
        country_slice_join['interpolate'] = country_slice_join['Electric Consumption (MW)'].interpolate(method='linear', axis=0)
        
        country_slice_join['Country Code'] = country_slice_join['Country Code'].ffill()
        # remove temp columns
        final_df = country_slice_join[['Country Code', 'interpolate']]
        
        final_df.columns = ['Country Code', 'Electric Consumption (MW)']
    

    Example country_slice_join output before final_df:

               Country Code Electric Consumption (MW)L  Three Day Gap   Electric Consumption (MW)   interpolate
    2019-12-27  12.0    374340.0    False   374340.0    374340.000000
    2019-12-28  12.0    372761.0    False   372761.0    372761.000000
    2019-12-29  12.0    379411.0    False   379411.0    379411.000000
    2019-12-30  12.0    416044.0    False   416044.0    416044.000000
    2019-12-31  12.0    NaN         NaN     NaN         369111.857143
    2020-01-01  12.0    NaN         NaN     NaN         322179.714286
    2020-01-02  12.0    NaN         NaN     NaN         275247.571429
    2020-01-03  12.0    NaN         NaN     NaN         228315.428571
    2020-01-04  12.0    NaN         NaN     NaN         181383.285714
    2020-01-05  12.0    NaN         NaN     NaN         134451.142857
    2020-01-06  12.0    87519.0     True    87519.0     87519.000000
    

    Example final_df output without temporary columns:

             Country Code   Electric Consumption (MW)
    2019-12-27  12.0    374340.000000
    2019-12-28  12.0    372761.000000
    2019-12-29  12.0    379411.000000
    2019-12-30  12.0    416044.000000
    2019-12-31  12.0    369111.857143
    2020-01-01  12.0    322179.714286
    2020-01-02  12.0    275247.571429
    2020-01-03  12.0    228315.428571
    2020-01-04  12.0    181383.285714
    2020-01-05  12.0    134451.142857
    2020-01-06  12.0    87519.000000