Search code examples
pythonperformancepandasdataframememory-efficient

Efficient shifting based on date within groups in Pandas?


I have a dataframe df:

df = pd.DataFrame({'id1':[1,1,1,1,1,4,4,4,6,6],
                     'id2':[45,45,33,33,33,1,1,1,34,34],
                     'vals':[0.1,0.2,0.6,0.1,0.15,0.34,0.12,0.5,0.4,0.45],
                     'date':pd.to_datetime(['2017-01-01','2017-01-02','2017-01-01',
                                            '2017-04-01','2017-04-02','2017-01-01',
                                            '2017-01-02','2017-01-03','2017-01-04',
                                            '2017-01-05'])})

I want to create lag terms based on time for each group of id1 and id2. For example, t_1 would be the value from the day before. t_2 would be the value from two days before. If there is no value from two-days before, I would like it to be nan. This would be the output for the above dataframe:

    date        id1 id2 vals    t_1   t_2
0   2017-01-01  1   33  0.60    NaN   NaN
1   2017-04-01  1   33  0.10    NaN   NaN
2   2017-04-02  1   33  0.15    0.10  NaN
0   2017-01-01  1   45  0.10    NaN   NaN
1   2017-01-02  1   45  0.20    0.10  NaN
0   2017-01-01  4   1   0.34    NaN   NaN
1   2017-01-02  4   1   0.12    0.34  NaN
2   2017-01-03  4   1   0.50    0.12  0.34
0   2017-01-04  6   34  0.40    NaN   NaN
1   2017-01-05  6   34  0.45    0.40  NaN

I can do this by using the code below, but it is extremely inefficient for a large number of groups - i.e. if I have 10000 x 500 unique combinations of id1 and id2, several days of data for each, and I want 2 lag terms, it takes a long time.

num_of_lags = 2
for i in range(1, num_of_lags+1):
    final = pd.DataFrame()
    for name, group in df.groupby(['id1', 'id2']):
        temp = group.set_index('date', verify_integrity=False)
        temp = temp.shift(i, 'D').rename(columns={'vals':'t_' + str(i)}).reset_index()
        group = pd.merge(group, temp[['id1', 'id2', 'date', 't_' + str(i)]], 
                         on=['id1', 'id2', 'date'], how='left')
        final = pd.concat([final, group], axis=0)
    df = final.copy()

Is there a more efficient way of doing this?


Solution

  • By using a combination of assigning a group with unstack and shift its possible to avoid the usage of apply, resulting in a great speedup.

    def compute_shift(df):
      df['group_no'] = df.groupby(['id1','id2']).ngroup()
      tmp = df[['date','vals','group_no']].set_index(['group_no','date'])\
                                          .unstack('group_no')\
                                          .resample('D').asfreq()
      tmp1 = tmp.shift(1).stack('group_no')['vals'].rename('t_1')
      tmp2 = tmp.shift(2).stack('group_no')['vals'].rename('t_2')
    
      df = df.join(tmp1, on=['date','group_no'])
      df = df.join(tmp2, on=['date','group_no'])
      return df
    
    compute_shift(df)
    date  id1  id2  vals  group_no   t_1   t_2
    0 2017-01-01    1   45  0.10         1   NaN   NaN
    1 2017-01-02    1   45  0.20         1  0.10   NaN
    2 2017-01-01    1   33  0.60         0   NaN   NaN
    3 2017-04-01    1   33  0.10         0   NaN   NaN
    4 2017-04-02    1   33  0.15         0  0.10   NaN
    5 2017-01-01    4    1  0.34         2   NaN   NaN
    6 2017-01-02    4    1  0.12         2  0.34   NaN
    7 2017-01-03    4    1  0.50         2  0.12  0.34
    8 2017-01-04    6   34  0.40         3   NaN   NaN
    9 2017-01-05    6   34  0.45         3  0.40   NaN
    

    To compare performance I created a fake dataset of reasonable size:

    df = pd.DataFrame({'date':np.random.randint(1, 1000, 10**6), 
                       'id1':np.random.randint(1, 100, 10**6),
                       'id2':np.random.randint(1, 100, 10**6),
                       'vals':np.random.random(10**6)})
    df = df.drop_duplicates(subset=['date','id1','id2'], keep='last')
    df = df.sort_values('date')
    dates = pd.date_range('20150101','20180101').to_series().reset_index(drop=True)
    df['date'] = df['date'].map(dates)
    

    If we compare performance with the solution of Wen and Scott:

    %timeit df.groupby(['id1','id2'],sort=False).apply(lambda x : x['vals'].shift()*((x['date'] -  pd.to_timedelta(1, unit='d')).isin(x['date'].tolist())).replace(False,np.nan))
    824 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit df.groupby(['id1','id2'], as_index=False)\
       .apply(lambda x: x.assign(t_1=x.vals.resample('D').asfreq().shift(1),\
                                 t_2=x.vals.resample('D').asfreq().shift(2)))
    1.38 s ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit compute_shift(df)
    96.4 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    If your DataFrame is not that big i would probably prefer Scott Bostons solution because it feels cleaner but if runtime is a concern unstack+shift+join is faster.

    EDIT: Added resample to fill missing dates.