Search code examples
pythonpandasdataframegroup-byaggregate-functions

How to get rid of nested column names in Pandas from group by aggregation?


I have the following code that finds the total and unique sales for each employee using a group by with Employee_id and aggregation with Customer_id.

Sales.groupby('Employee_id').agg({
    'Customer_id': [
        ('total_sales', 'count'),
        ('unique_sales', 'nunique')
]})

It is important to know that I will perform aggregations with other columns as well, but so far this is all I have written. So if you have a proposed solution, I ask that you please consider that in case it makes a difference.

While this does exactly what I want in terms of computing total and unique sales for each employee and creating two columns, it creates nested column names. So the column names look like, [('Customer_id', 'total_sales'), ('Customer_id', 'unique_sales')], which I don't want. Is there any way to easily get rid of the nested part to only include ['total_sales', 'unique_sales'], or is the easiest thing to just rename the columns once I have finished everything?

Thanks!


Solution

  • You could simply rename the columns:

    import numpy as np
    import pandas as pd
    np.random.seed(2018)
    
    df = pd.DataFrame(np.random.randint(10, size=(100, 3)), columns=['A','B','C'])
    result = df.groupby('A').agg({'B': [('D','count'),('E','nunique')],
                                  'C': [('F','first'),('G','max')]})
    result.columns = result.columns.get_level_values(1)
    print(result)
    

    Alternatively, you could save the groupby object, and use grouped[col].agg(...) to produce sub-DataFrames which can then be pd.concat'ed together:

    import numpy as np
    import pandas as pd
    np.random.seed(2018)
    df = pd.DataFrame(np.random.randint(10, size=(100, 3)), columns=['A','B','C'])
    grouped = df.groupby('A')
    result = pd.concat([grouped['B'].agg([('D','count'),('E','nunique')]),
                        grouped['C'].agg([('F','first'),('G','max')])], axis=1)
    print(result)
    

    both code snippets yield the following (though with columns perhaps in a different order):

        D  E  F  G
    A             
    0  18  8  8  9
    1  12  8  6  6
    2  14  8  0  8
    3  10  9  8  9
    4   7  6  3  5
    5   8  5  6  7
    6   9  7  9  9
    7   8  6  4  7
    8   8  7  2  9
    9   6  5  7  9
    

    Overall, I think renaming the columns after-the-fact is the easiest and more readable option.