Search code examples
pythonpandascsvdataframeexport-to-csv

How to use an aggregate function on two columns in Pandas without messing up the index?


I'm having this issue where using agg(['sum', 'count']) is outputting "subcolumns" rather than new, exclusive columns.

This is a problem because I need to export the results to a csv for salesforce upload, and the subcolumns are messing up the formatting.

So, first there is this dataframe:

Order ID | Company ID | Company Name | Product   | Date    | Units in order | Units delivered | Platform
1          J01          John's        Bubblegum   01-01-20     100             100            On Prem
2          J01          John's        Bubblegum   01-01-20     200             150            On Prem
3          M01          Mary's        Jawbreakers  01-01-20    100             100            Online 

After that, I create a subset DF from it

Subset = df[['Company ID', 'Company Name', 'Platform', 'Date', 'Units in order', 'Units delivered']]

Then I proceed to group and apply aggregate on the subset to get the sum of the units columns and the count (to know how many orders were made)

Subset = Subset.groupby(['Company ID', 'Company Name', 'Platform', 'Date'])['Units in order', 'Units delivered'].agg(['sum', 'count']).reset_index()

However, this messes up the indexes of the columns. Instead of outputting

Company ID | Company Name | Platform | Date | Sum of UIO | Count of UIO | Sum of UD | Count of UD

It's outputting

Company ID | Company Name | Platform | Date | Sum of UIO | Units in order | Units delivered |       
                                                            sum | count      sum | count

Effectively creating 4 "subcolumns". Of course when exporting to CSV, this creates a "format" that is not properly read by anything other than pandas.

How can this issue be solved and why is it happening? At first I thought it was because there was a naming conflict since I'd have 2 sum and 2 count columns, but I tried to rename them and it didnt fix the issue.


Solution

  • You can solve this issue by using pandas NamedAgg which is available since pandas 0.25.0 and python 3.5.0 if not mistaken. The code would be as follows:

    Subset = Subset.groupby(['Company ID', 'Company Name', 'Platform', 'Date']).agg(
       Units_in_order_sum = pd.NamedAgg(column='Units in order', aggfunc='sum'),
       Units_in_order_count = pd.NamedAgg(column='Units in order', aggfunc='count'),
       Units_delivered_sum = pd.NamedAgg(column='Units delivered',aggfunc='sum'),
       Units_delivered_count = pd.NamedAgg(column='Units delivered',aggfunc='count')).reset_index()