Consider the following DataFrame:
candy = pd.DataFrame({'Name':['Bob','Bob','Bob','Annie','Annie','Annie','Daniel','Daniel','Daniel'], 'Candy': ['Chocolate', 'Chocolate', 'Lollies','Chocolate', 'Chocolate', 'Lollies','Chocolate', 'Chocolate', 'Lollies'], 'Value':[15,15,10,25,30,12,40,40,16]})
After reading this post, I am aware that apply()
works on the whole Dataframe and transform()
works on one series at-a-time.
So if I want to append the total $ spend on candy per person, I can simply use the following:
candy['Total Spend'] = candy.groupby(['Name'])['Value'].transform(sum)
But if I need to append the total $ chocolate spend per person, it feels like I have no choice but to create a separate dataframe and then merging it back by using the apply()
function since transform()
only works on a series.
chocolate = candy.groupby(['Name']).apply(lambda x: x[x['Candy'] == 'Chocolate']['Value'].sum()).reset_index(name = 'Total_Chocolate_Spend')
candy = pd.merge(candy, chocolate, how = 'left',left_on=['Name'], right_on=['Name'])
While I don't mind writing the above code to solve this problem. Is it possible to transform()
the .apply()
'd results back to the dataframe without having to create a separate dataframe and merge
it?
What is actually happening when the transform()
function is used? Is a separate series being stored in memory and then merged back by the indexes similar to what I have done in the apply
then merged method?
I do not have much to add to the excellent reference you provided on apply vs. transform, but you can do what you want without creating a separate dataframe, for example you can do
candy.groupby(['Name']).apply(lambda x: x.assign(Total_Chocolate_Spend = x[x['Candy'] == 'Chocolate']['Value'].sum()))
this uses assign
for each group in groupby to populate Total_Chocolate_Spend
with the number you want