Search code examples
pythonpandastime-seriesdecomposition

Pandas time series decomposition based on leap year


I have a pandas Time Series (called df) that has one column (with name data) that contains data with a daily frequency over a time period of 5 years. The following code produces some random data:

import pandas as pd
import numpy as np


df_index = pd.date_range('01-01-2012', periods=5 * 365 + 2, freq='D')
df = pd.DataFrame({'data': np.random.rand(len(df_index))}, index=df_index)

I want to perform a simple yearly trend decomposition, where for each day I subtract its value one year ago. Aditionally, I want to attend leap years in the subtraction. Is there any elegant way to do that? My way to do this is to perform differences with 365 and 366 days and assign them to new columns.

df['diff_365'] = df['data'].diff(365)
df['diff_366'] = df['data'].diff(366)

Afterwards, I apply a function to each row thats selects the right value based on whether the same date from last year is 365 or 366 days ago.

def decide(row):
    if (row.name - 59).is_leap_year:
        return row[1]
    else:
        return row[0]

df['yearly_diff'] = df[['diff_365', 'diff_366']].apply(decide, axis=1)

Explanation: the function decide takes as argument a row from the DataFrame consisting of the columns diff_365 and diff_366 (along with the DatetimeIndex). The expression row.name returns the date of the row and assuming the time series has daily frequency (freq = 'D'), 59 days are subtracted which is the number of days from 1st January to 28th February. Based on whether the resulting date is a day from a leap year, the value from the diff_366 column is returned, otherwise the value from the diff_365 column.

This took 8 lines and it feels that the subtraction can be performed in one or two lines. I tried to apply a similiar function directly to the data column (via apply and taking the default argument axis=0). But in this case, I cannot take my DatetimeIndex into account. Is there a better to perform the subtraction?


Solution

  • You may not need to worry about dealing with leap years explicitly. When you construct a DatetimeIndex, you can specify start and end parameters. As per the docs:

    Of the four parameters start, end, periods, and freq, exactly three must be specified.

    Here's an example of how you can restructure your logic:

    df_index = pd.date_range(start='01-01-2012', end='12-31-2016', freq='D')
    
    df = pd.DataFrame({'data': np.random.rand(len(df_index))}, index=df_index)
    
    df['yearly_diff'] = df['data'] - (df_index - pd.DateOffset(years=1)).map(df['data'].get)
    

    Explanation

    • We construct a DatetimeIndex object by supplying start, end and freq arguments.
    • Subtract 1 year from your index by subtracting pd.DateOffset(years=1).
    • Use pd.Series.map to map these 1yr behind dates to data.
    • Subtract the resulting series from the original data series.