python-3.x pandas dataframe shift datetimeindex

Pandas shift datetimeindex takes too long time running

I have a running time issue with shifting a large dataframe with datetime index.

Example using created dummy data:

df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13]*10**5,'col3':list(np.random.randint(0,100000,14*10**5)),'col2':list(pd.date_range('2020-01-01','2020-08-01',freq='M'))*2*10**5})
df.col3=df.col3.astype(str)
df.drop_duplicates(subset=['col3','col2'],keep='first',inplace=True)

If I shift not using datetimeindex, it only takes about 12s:

%%time
tmp=df.groupby('col3')['col1'].shift(2,fill_value=0)
Wall time: 12.5 s

But when I use datetimeindex, as that situation that I need, it takes about 40 minutes:

%%time    
tmp=df.set_index('col2').groupby('col3')['col1'].shift(2,freq='M',fill_value=0)
Wall time: 40min 25s

In my situation, I need the data from shift(1) until shift(6) and merge them with original data by col2 and col3. So I use for looping and merge. Is there any solution for this? Thanks for your answer, will appreciate so much any respond.

Ben's answer solves it:

%%time
tmp=df1[['col1','col3', 'col2']].assign(col2 = lambda x: x['col2'] + MonthEnd(2)).set_index(['col3', 'col2']).add_suffix(f'_{2}').fillna(0).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 5.94 s

also implement to the looping:

%%time
res=(pd.concat([df1.assign(col2 = lambda x: x['col2'] + MonthEnd(i)).set_index(['col3', 'col2']).add_suffix(f'_{i}') for i in range(0,7)],axis=1).fillna(0)).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index() 
Wall time: 1min 44s

Actually, my real data is already using MonthEnd(0) so I just use loop in range(1,7). I also implement to multiple columns so I don't use astype and implement reindex because I use left merge.

Solution

The two operations are slightly different, and the results are not the same because your data (at least the dummy here) is not ordered and especially if you have missing dates for some col3 values. That said, the time difference seems enormous. So I think you should go a bit differently.

One way is to add X MonthEnd to col2 for X from 0 to 6, use concat all of them, after set_index the col3 and col2, add_suffix to keep track of the "shift" value. fillna and convert the dtype to original one. The rest is mostly cosmetic depending on your needs.

from pandas.tseries.offsets import MonthEnd

res = (
    pd.concat([
        df.assign(col2 = lambda x: x['col2']  + MonthEnd(i))
          .set_index(['col3', 'col2'])
          .add_suffix(f'_{i}')
        for i in range(0,7)], 
        axis=1)
      .fillna(0) 
      # depends on your original data
      .astype(df['col1'].dtype) 
      # if you want a left merge ordered like original df
      #.reindex(pd.MultiIndex.from_frame(df[['col3','col2']]))
      # if you want col2 and col3 back as columns
      # .reset_index() 
)

Note that concat does a outer join by default, so you end up with month that where not in your original data and col1_0 is actually the original data with my random numbers.

print(res.head(10))
                 col1_0  col1_1  col1_2  col1_3  col1_4  col1_5  col1_6
col3 col2                                                              
0    2020-01-31       7       0       0       0       0       0       0
     2020-02-29       8       7       0       0       0       0       0
     2020-03-31       2       8       7       0       0       0       0
     2020-04-30       3       2       8       7       0       0       0
     2020-05-31       4       3       2       8       7       0       0
     2020-06-30      12       4       3       2       8       7       0
     2020-07-31      13      12       4       3       2       8       7
     2020-08-31       0      13      12       4       3       2       8
     2020-09-30       0       0      13      12       4       3       2
     2020-10-31       0       0       0      13      12       4       3