I have a running time issue with shifting a large dataframe with datetime index.
Example using created dummy data:
df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13]*10**5,'col3':list(np.random.randint(0,100000,14*10**5)),'col2':list(pd.date_range('2020-01-01','2020-08-01',freq='M'))*2*10**5})
df.col3=df.col3.astype(str)
df.drop_duplicates(subset=['col3','col2'],keep='first',inplace=True)
If I shift not using datetimeindex, it only takes about 12s:
%%time
tmp=df.groupby('col3')['col1'].shift(2,fill_value=0)
Wall time: 12.5 s
But when I use datetimeindex, as that situation that I need, it takes about 40 minutes:
%%time
tmp=df.set_index('col2').groupby('col3')['col1'].shift(2,freq='M',fill_value=0)
Wall time: 40min 25s
In my situation, I need the data from shift(1) until shift(6) and merge them with original data by col2
and col3
. So I use for
looping and merge.
Is there any solution for this? Thanks for your answer, will appreciate so much any respond.
Ben's answer solves it:
%%time
tmp=df1[['col1','col3', 'col2']].assign(col2 = lambda x: x['col2'] + MonthEnd(2)).set_index(['col3', 'col2']).add_suffix(f'_{2}').fillna(0).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 5.94 s
also implement to the looping:
%%time
res=(pd.concat([df1.assign(col2 = lambda x: x['col2'] + MonthEnd(i)).set_index(['col3', 'col2']).add_suffix(f'_{i}') for i in range(0,7)],axis=1).fillna(0)).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 1min 44s
Actually, my real data is already using MonthEnd(0)
so I just use loop in range(1,7)
. I also implement to multiple columns so I don't use astype
and implement reindex
because I use left merge
.
The two operations are slightly different, and the results are not the same because your data (at least the dummy here) is not ordered and especially if you have missing dates for some col3 values. That said, the time difference seems enormous. So I think you should go a bit differently.
One way is to add X MonthEnd
to col2 for X from 0 to 6, use concat
all of them, after set_index
the col3 and col2, add_suffix
to keep track of the "shift" value. fillna
and convert the dtype
to original one. The rest is mostly cosmetic depending on your needs.
from pandas.tseries.offsets import MonthEnd
res = (
pd.concat([
df.assign(col2 = lambda x: x['col2'] + MonthEnd(i))
.set_index(['col3', 'col2'])
.add_suffix(f'_{i}')
for i in range(0,7)],
axis=1)
.fillna(0)
# depends on your original data
.astype(df['col1'].dtype)
# if you want a left merge ordered like original df
#.reindex(pd.MultiIndex.from_frame(df[['col3','col2']]))
# if you want col2 and col3 back as columns
# .reset_index()
)
Note that concat
does a outer join by default, so you end up with month that where not in your original data and col1_0 is actually the original data with my random numbers.
print(res.head(10))
col1_0 col1_1 col1_2 col1_3 col1_4 col1_5 col1_6
col3 col2
0 2020-01-31 7 0 0 0 0 0 0
2020-02-29 8 7 0 0 0 0 0
2020-03-31 2 8 7 0 0 0 0
2020-04-30 3 2 8 7 0 0 0
2020-05-31 4 3 2 8 7 0 0
2020-06-30 12 4 3 2 8 7 0
2020-07-31 13 12 4 3 2 8 7
2020-08-31 0 13 12 4 3 2 8
2020-09-30 0 0 13 12 4 3 2
2020-10-31 0 0 0 13 12 4 3