First off, I am fairly new to Python & Pandas, so please be patient and reply in as simple terms as possible. Also, if you could elaborate on any code that is different than what I have in my sample or point me to a solid reference that would make it easy to understand, I would greatly appreciate it.
I have a dataframe (df1) of monthly data with 60+ columns & 800k rows (& growing) for 6000+ locations. I am trying to calculate the rolling mean (3 mo, 12 mo, YTD, etc) based on the location license number ('lic_num', int), month ('mo_yr', date). I have been successful in doing this using apply(). The problem is that apply() feels very slow taking 10 min. This isn't a major issue for this project because this wont be something that will need to be run on demand, but I want to become more efficient at writing code similar to this in the case where I need a project to execute faster. Here is a sample of the dataframe (df1) and my code that I use to achieve my results
lic_num mo_yr ap aw fi
120700142 2013-03-01 228214.3 206273.53 61393.0
120700142 2013-04-01 256239.4 235296.96 64228.0
120700142 2013-05-01 247725.3 227165.09 74978.0
120700142 2013-06-01 229776.8 211765.55 64559.0
120700142 2013-07-01 229036.2 210963.06 58132.0
df1_col_list = df1.columns.tolist()
for col in df1_col_list[2:5]:
df1[col+'_3mo'] = df1.groupby('lic_num', as_index=False).apply(
lambda x: x.rolling(3, on='mo_yr', min_periods=1)[col].mean()).reset_index(level=0, drop=True)
lic_num mo_yr ap aw fi ap_3mo aw_3mo fi_3mo
120700142 2013-03-01 228214.3 206273.53 61393.0 228214.300000 206273.530000 61393.000000
120700142 2013-04-01 256239.4 235296.96 64228.0 242226.850000 220785.245000 62810.500000
120700142 2013-05-01 247725.3 227165.09 74978.0 244059.666667 222911.860000 66866.333333
120700142 2013-06-01 229776.8 211765.55 64559.0 244580.500000 224742.533333 67921.666667
120700142 2013-07-01 229036.2 210963.06 58132.0 235512.766667 216631.233333 65889.666667
If apply
is slow , we try not use it. Here is more info about the reason why apply
is slow When should I ever want to use pandas apply() in my code?
s=df.groupby('lic_num', as_index=False).\
rolling(3, on='mo_yr', min_periods=1).\
mean().iloc[:,2:5].\
add_suffix('_3mo').reset_index(drop=True,level=0)
df=pd.concat([df,s],axis=1)