Search code examples
pythonpandasdata-analysis

Grouping by multiple criteria in pandas


I have a pandas data structure such as this:

>>> df
        Benny  Daniel   Doris   Eric   Jack    Zoe
Age        75      30      95     25     28     23
Salary   2000    9000  100000  10000  12000  20000 

I would like to find the mean age and salary for several different groups, where each is a subset of the columns, and they may overlap, such as this dictionary for example:

{'Parrot lovers': ['Doris', 'Benny'], 'Tea Drinkers': ['Doris', 'Zoe'],\
 'Maintainance': ['Benny', 'Jack'], 'Coffee Drinkers': ['Benny', 'Eric'],\
 'Senior Management': ['Doris', 'Zoe', 'Jack']}

How can I design a groupby function that will do this?


Solution

  • Here is how I set up the problem...

    import StringIO
    import pandas as pd
    
    df = """index  Benny  Daniel   Doris   Eric   Jack    Zoe
    Age        75      30      95     25     28     23
    Salary   2000    9000  100000  10000  12000  20000"""
    df = pd.read_csv(StringIO.StringIO(df),sep="\s+").set_index('index')
    d = {'Parrot lovers': ['Doris', 'Benny'], 'Tea Drinkers': ['Doris', 'Zoe'],\
     'Maintainance': ['Benny', 'Jack'], 'Coffee Drinkers': ['Benny', 'Eric'],\
     'Senior Management': ['Doris', 'Zoe', 'Jack']}
    

    For the solution Just Use .loc and iterate through the dictionary...

    averages = {k:df.loc[:,v].mean(axis=1) for k,v in d.iteritems()}
    print pd.DataFrame(averages).T #gives the nice printout...
    
    index                    Age  Salary
    Coffee Drinkers    50.000000    6000
    Maintainance       51.500000    7000
    Parrot lovers      85.000000   51000
    Senior Management  48.666667   44000
    Tea Drinkers       59.000000   60000