Search code examples
pythonpandastime-serieslagshift

I want to (Lag or shift) a time series by a (day, week, month or year) without loops


I am basically trying to make new columns of my Time Serie and I want te lag of some days, weeks, months or years as wanted. I have made a function that solves this problem but is highly ineficien.

def lag_N_period ( df, y , days_ago=0 , weeks_ago=0 , months_ago=0 , years_ago=0 ):
   


    skip = days_ago + weeks_ago*7 + months_ago*31 + years_ago*366 


    
    ## FEATURE NAME ## 
    feature_name = '' 
    
    if days_ago > 0  :
        feature_name = feature_name + str(days_ago) + 'days_' 
    if weeks_ago > 0  :
        feature_name = feature_name + str(weeks_ago) + 'weeks_'
    if months_ago > 0  :
        feature_name = feature_name + str(months_ago) + 'months_'        
    if years_ago > 0  :
        feature_name = feature_name + str(years_ago) + 'years_'        
        
    feature_name = feature_name + 'ago'



    
    df[feature_name] = [np.nan for i in range(len(df[objetivo])) ] #Creates NaN column named 'feature_name'

    
    for i in df.index[skip:]:

        j = i - dateutil.relativedelta.relativedelta(days=days_ago , weeks=weeks_ago , months=months_ago , years=years_ago) 
        df[feature_name][i] = df[y][j]

    return df

The skip is just a int because if in the loop you call for a index in the dataframe and it doesn´t exist, you get an error, but anything else.

df is my dataframe with dates as index and 'y', the objective variable

            objective
date    
2018-01-01  3420
2018-01-02  100580
2018-01-03  78500
2018-01-04  72640
2018-01-05  64980
... ...
2021-01-27  76820
2021-01-28  90520
2021-01-29  81920
2021-01-30  20080
2021-01-31  0

I have try the .shift() function as .shift(1, period='M') but it's not the output y want. The only case it works is when i just want the lag of 5 or some days ago llike, .shift(5)


Solution

  • Given a dataframe with a DatetimeIndex which doesn't have any missing days like this

    df = pd.DataFrame(
        {"A": range(500)}, index=pd.date_range("2022-03-01", periods=500, freq="1D")
    )
    
                  A
    2022-03-01    0
    2022-03-02    1
    ...         ...
    2023-07-12  498
    2023-07-13  499
    

    you could do the following

    from dateutil.relativedelta import relativedelta
    
    delta = relativedelta(months=1)
    df["B"] = None  # None instead of other NaNs - can be changed
    idx = df.loc[df.index[0] + delta:].index
    df.loc[idx, "B"] = df.loc[[day - delta for day in idx], "A"].values
    

    and get

                  A     B
    2022-03-01    0  None
    2022-03-02    1  None
    ...         ...   ...
    2023-07-12  498   468
    2023-07-13  499   469
    

    The idx is there to make sure that the actual shifting doesn't fail. It's the part you're trying to address by skip. (Your skip is actually a bit imprecise because you're using 31/366 days for month/year lengths universally.)

    But be prepared to run into strange phenomena when you're using months and/or years. For example

    from datetime import date
    
    delta = relativedelta(months=1)
    date(2022, 3, 30) + delta == date(2022, 3, 31) + delta
    

    is True.