Search code examples
pandastransformuniquegrouping

Pandas groupby multiple keys selecting unique values and transforming


I have a data frame df=

Owner     Manager     Date      Hours  City
 John      Jerry       1/2/16     10    LA
 John      Jerry       1/2/16     10    SF
 Mary      Jerry       1/2/16     9     LA
 Zach      Joe         1/3/16     5     SD
 Wendy     Joe         1/3/16     4     SF
 Hal       Joe         1/4/16     2     SD

... 100,000 entries

I would like to group by 'Manager' and 'Date', then select unique values of 'Owner' and sum 'Hours' of that selection, finally transforming the sum to a new column 'Hours_by_Manager'.

My desired output is:

Owner     Manager     Date      Hours  City   Hours_by_Manager
   John      Jerry       1/2/16     10    LA   19
   John      Jerry       1/2/16     10    SF   19
   Mary      Jerry       1/2/16     9     LA   19
   Zach      Joe         1/3/16     5     SD   9
   Wendy     Joe         1/3/16     4     SF   9
   Hal       Joe         1/4/16     2     SD   2

I tried using pandas 'groupby' like this:

df['Hours_by_Manager']=df.groupby(['Manager','Date'])['Hours'].transform(lambda x: sum(x.unique()))

Which gives me what I want, but only because the value of hours is different between 'Owner'. What I'm looking for is something like this: df['Hours_by_Manager']=df.groupby(['Manager','Date'])['Owner'].unique()['Hours']transform(lambda x: sum(x)) Which obviously is not syntactically correct. I know I could use for loops, but I would like to keep things vectorized. Any suggestions?


Solution

  • import pandas as pd
    df = pd.DataFrame({'City': ['LA', 'SF', 'LA', 'SD', 'SF', 'SD'],
        'Date': ['1/2/16', '1/2/16', '1/2/16', '1/3/16', '1/3/16', '1/4/16'],
        'Hours': [10, 10, 9, 5, 4, 2],
        'Manager': ['Jerry', 'Jerry', 'Jerry', 'Joe', 'Joe', 'Joe'],
        'Owner': ['John', 'John', 'Mary', 'Zach', 'Wendy', 'Hal']})
    
    uniques = df.drop_duplicates(subset=['Hours','Owner','Date'])
    hours = uniques.groupby(['Manager', 'Date'])['Hours'].sum().reset_index()
    hours = hours.rename(columns={'Hours':'Hours_by_Manager'})
    result = pd.merge(df, hours, how='left')
    print(result)
    

    yields

      City    Date  Hours Manager  Owner  Hours_by_Manager
    0   LA  1/2/16     10   Jerry   John                19
    1   SF  1/2/16     10   Jerry   John                19
    2   LA  1/2/16      9   Jerry   Mary                19
    3   SD  1/3/16      5     Joe   Zach                 9
    4   SF  1/3/16      4     Joe  Wendy                 9
    5   SD  1/4/16      2     Joe    Hal                 2
    

    Explanation:

    An Owner on a given Date works a unique number of Hours. So let's first create a table of unique ['Hours','Owner','Date'] rows:

    uniques = df.drop_duplicates(subset=['Hours','Owner','Date'])
    # alternatively, uniques = df.groupby(['Hours','Owner','Date']).first().reset_index()
    #   City    Date  Hours Manager  Owner
    # 0   LA  1/2/16     10   Jerry   John
    # 2   LA  1/2/16      9   Jerry   Mary
    # 3   SD  1/3/16      5     Joe   Zach
    # 4   SF  1/3/16      4     Joe  Wendy
    # 5   SD  1/4/16      2     Joe    Hal
    

    Now we can group by ['Manager', 'Date'] and sum the Hours:

    hours = uniques.groupby(['Manager', 'Date'])['Hours'].sum().reset_index()
      Manager    Date  Hours
    0   Jerry  1/2/16     19
    1     Joe  1/3/16      9
    2     Joe  1/4/16      2
    

    The hours['Hours'] column contains the values we want in df['Hours_by_Manager'].

    hours = hours.rename(columns={'Hours':'Hours_by_Manager'})
    

    So now we can merge df and hours to obtain the desired result:

    result = pd.merge(df, hours, how='left')
    #   City    Date  Hours Manager  Owner  Hours_by_Manager
    # 0   LA  1/2/16     10   Jerry   John                19
    # 1   SF  1/2/16     10   Jerry   John                19
    # 2   LA  1/2/16      9   Jerry   Mary                19
    # 3   SD  1/3/16      5     Joe   Zach                 9
    # 4   SF  1/3/16      4     Joe  Wendy                 9
    # 5   SD  1/4/16      2     Joe    Hal                 2