Lets say we have a sample dataframe that looks like this
# Create a sample dataframe
df = pd.DataFrame({'num_posts': [4, 6, 3, 9, 1, 14, 2, 5, 7, 2,12],
'date' : ['2020-03-01', '2020-01-02', '2020-01-03',
'2020-01-04', '2019-01-05', '2019-01-06',
'2020-01-07', '2020-01-08', '2020-01-09',
'2020-01-10','2020-01-11']})
cols=['num_posts']
Now i need to create features such as sum(num_posts) for last week of previous year, ratio of last week of prev year by last month of previous year etc.
What i am doing right now is
def create_calendar_columns(df,dt):
df['day_number']= pd.to_datetime(df[dt]).dt.dayofyear
df['week_number']= pd.to_datetime(df[dt]).dt.isocalendar().week
df['year']= pd.to_datetime(df[dt]).dt.year
return df
and then
def calculate_curr_year_sum(df,cols,dt):
df=create_calendar_columns(df,dt)
df[dt]=pd.to_datetime(df[dt])
for col in cols:
df['curr_year_sum_'+str(col)]= df.loc[df['year']==df[dt].dt.year,col].sum()
# sum of 'col' for all rows having same year as of current row value
But it does not select the subset of dataframe having same year value. Instead it take the sum of whole dataframe. It insert value=65 in all rows which is sum of "num_posts"
How can i select the subset of dataframe where rows with year equal to current row value for the column?
That's because df['year']==df[dt].dt.year
is True
everywhere. You never iterate over the different years, so you won't see any different results for different years. What I think you want to do is a groupby transformation:
>>> df.groupby("year")["num_posts"].transform("sum")
0 50
1 50
2 50
3 50
4 15
5 15
6 50
7 50
8 50
9 50
10 50