Search code examples
pythonpandasdataframedatetime-series

How to handle duplicates using asfreq() function. Is there any other way to do this?


I have some hourly data on electricity generation from various sources in various countries. I download data from the ENTSO-E Transparency Platform website and found a problem with data inconsistency. For some data sets, the time series was interrupted for several hours because there were no appropriate rows for this data set on the source page - for less explained reasons. I solved this problem using the asfreq() method, which basically fills in the appropriate gaps with dashes and everything works :). This solved, for example, the problem of an hour being taken away due to daylight saving time (DST) and a dash is entered every hour 02:00:00 on the last Sunday in March.

However, the problem arises when I want to keep the additional hour 02:00:00 (given) on the last Sunday in October. The asfreq() function cannot work on duplicates and is forced to use drop_duplicates. As a result, I remove this additional hour in October, but I cannot do it because I shorten the entire annual series by this one line and if I would like to perform a broader analysis with other data, my generation sets are shorter and do not map the appropriate hours with other data sets, e.g. .energy prices or energy demand.

Let me introduce sample DataFrame with this issue:

# Generate a datetime index with hourly frequency over a day
date_rng = pd.date_range(start='2024-01-01', end='2024-01-02', freq='H')

# Sample DataFrame
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randn(len(df['date']))  # Random data
df.set_index('date', inplace=True)
# Intentionally removal of couple rows to meet DST March 2AM requirement
df = df.drop(pd.to_datetime(['2024-01-01 03:00', '2024-01-01 07:00', '2024-01-01 11:00']))

# Introduce a duplicate intentionally to meet DST October 2AM
duplicate_row = df.iloc[[5]]
part_before = df.iloc[:6]  # Include the original row in the first part
part_after = df.iloc[6:]  # Start the second part right after the original row

# Concatenate the three parts: before the original, the duplicate, and after the original
df = pd.concat([part_before, duplicate_row, part_after]).reset_index()

df.rename(columns={'index':'date'},inplace=True)
df.set_index('date', inplace=True)

My ouput of this dataframe is something like this:

date data
2024-01-01 00:00:00 0.958687
2024-01-01 01:00:00 -0.598715
2024-01-01 02:00:00 2.555558
2024-01-01 04:00:00 1.115459
2024-01-01 05:00:00 -1.719786
2024-01-01 06:00:00 -0.128536
2024-01-01 06:00:00 -0.128536
2024-01-01 08:00:00 -1.183776

'date' column is an index. What i would like to do in this dataset is to insert dashes in gaps in time series but leave this 06:00:00 row as a duplicate which would correspond to 2AM in October. My desired dataframe would like something like this:

date data
2024-01-01 00:00:00 0.958687
2024-01-01 01:00:00 -0.598715
2024-01-01 02:00:00 2.555558
2024-01-01 03:00:00 -
2024-01-01 04:00:00 1.115459
2024-01-01 05:00:00 -1.719786
2024-01-01 06:00:00 -0.128536
2024-01-01 06:00:00 -0.128536 #duplicate
2024-01-01 07:00:00 -
2024-01-01 08:00:00 -1.183776

Solution

  • A merge/join operation can handle the duplicates correctly. We just need to create another series that represents the spine and drop it after the join operation.

    idx = pd.date_range(df.index.min(), df.index.max(), freq='h', name='_date')
    print(
        df.join(idx.to_series(), how='right').drop(columns='_date')
    )
    #                          data
    # 2024-01-01 00:00:00  0.636962
    # 2024-01-01 01:00:00  0.269787
    # 2024-01-01 02:00:00  0.040974
    # 2024-01-01 03:00:00       NaN
    # 2024-01-01 04:00:00  0.813270
    # 2024-01-01 05:00:00  0.912756
    # 2024-01-01 06:00:00  0.606636
    # 2024-01-01 06:00:00  0.606636
    # 2024-01-01 07:00:00       NaN
    # 2024-01-01 08:00:00  0.543625
    # 2024-01-01 09:00:00  0.935072
    # 2024-01-01 10:00:00  0.815854
    # 2024-01-01 11:00:00       NaN
    # 2024-01-01 12:00:00  0.857404
    # 2024-01-01 13:00:00  0.033586
    # 2024-01-01 14:00:00  0.729655
    # 2024-01-01 15:00:00  0.175656
    # 2024-01-01 16:00:00  0.863179
    # 2024-01-01 17:00:00  0.541461
    # 2024-01-01 18:00:00  0.299712
    # 2024-01-01 19:00:00  0.422687
    # 2024-01-01 20:00:00  0.028320
    # 2024-01-01 21:00:00  0.124283
    # 2024-01-01 22:00:00  0.670624
    # 2024-01-01 23:00:00  0.647190
    # 2024-01-02 00:00:00  0.615385
    

    To replace the NaN values with literal dashes, you can use .fillna('-')

    idx = pd.date_range(df.index.min(), df.index.max(), freq='h', name='_date')
    print(
        df.join(idx.to_series(), how='right').drop(columns='_date')
        .fillna('-')
        .head(10)
    )
    #                          data
    # 2024-01-01 00:00:00  0.636962
    # 2024-01-01 01:00:00  0.269787
    # 2024-01-01 02:00:00  0.040974
    # 2024-01-01 03:00:00         -
    # 2024-01-01 04:00:00   0.81327
    # 2024-01-01 05:00:00  0.912756
    # 2024-01-01 06:00:00  0.606636
    # 2024-01-01 06:00:00  0.606636
    # 2024-01-01 07:00:00         -
    # 2024-01-01 08:00:00  0.543625