I'm having this issue where using agg(['sum', 'count']) is outputting "subcolumns" rather than new, exclusive columns.
This is a problem because I need to export the results to a csv for salesforce upload, and the subcolumns are messing up the formatting.
So, first there is this dataframe:
Order ID | Company ID | Company Name | Product | Date | Units in order | Units delivered | Platform
1 J01 John's Bubblegum 01-01-20 100 100 On Prem
2 J01 John's Bubblegum 01-01-20 200 150 On Prem
3 M01 Mary's Jawbreakers 01-01-20 100 100 Online
After that, I create a subset DF from it
Subset = df[['Company ID', 'Company Name', 'Platform', 'Date', 'Units in order', 'Units delivered']]
Then I proceed to group and apply aggregate on the subset to get the sum of the units columns and the count (to know how many orders were made)
Subset = Subset.groupby(['Company ID', 'Company Name', 'Platform', 'Date'])['Units in order', 'Units delivered'].agg(['sum', 'count']).reset_index()
However, this messes up the indexes of the columns. Instead of outputting
Company ID | Company Name | Platform | Date | Sum of UIO | Count of UIO | Sum of UD | Count of UD
It's outputting
Company ID | Company Name | Platform | Date | Sum of UIO | Units in order | Units delivered |
sum | count sum | count
Effectively creating 4 "subcolumns". Of course when exporting to CSV, this creates a "format" that is not properly read by anything other than pandas.
How can this issue be solved and why is it happening? At first I thought it was because there was a naming conflict since I'd have 2 sum and 2 count columns, but I tried to rename them and it didnt fix the issue.
You can solve this issue by using pandas NamedAgg
which is available since pandas 0.25.0
and python 3.5.0
if not mistaken. The code would be as follows:
Subset = Subset.groupby(['Company ID', 'Company Name', 'Platform', 'Date']).agg(
Units_in_order_sum = pd.NamedAgg(column='Units in order', aggfunc='sum'),
Units_in_order_count = pd.NamedAgg(column='Units in order', aggfunc='count'),
Units_delivered_sum = pd.NamedAgg(column='Units delivered',aggfunc='sum'),
Units_delivered_count = pd.NamedAgg(column='Units delivered',aggfunc='count')).reset_index()