Search code examples
pythonpandasdataframepandas-groupbypython-itertools

How to summarize on different groupby combinations?


I am compiling a table of top-3 crops by county. Some counties have the same crop varieties in the same order. Other counties have the same crop varieties in a different order.

df1 = pd.DataFrame( { 
    "County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] , 
    "Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"],
    "Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"],
    "Crop3": ["apples","apples","apples","grain","raddish","carrots","raddish","raddish"],
    "Total_pop": [2000,1500,3000,1500,2000,2500,2700,2000]} )

I can do a groupby on Crop1, Crop2 and Crop3 and get the sum of total_pop:

df1_grouped=df1.groupby(['Crop1',"Crop2","Crop3"])['Total_pop'].sum().reset_index()

That gives me the total for specific combinations of the crops:

df1_grouped
apples  melons  grain   1500
grain   melons  apples  2000
melons  grain   apples  4500
pears   carrots raddish 6700
raddish pears   carrots 2500

What I would like, though, is to get the total population on different combinations of crops -- irrespective of whether the listed crop was crop1, crop2, or crop3. The desired result would be this:

apples  melons   grain    8000
pears   carrots  raddish  9200 

Thank you for any guidance.


Solution

  • Method 1:

    Combine the crop columns

    >>> df1['combined_temp'] = df1.apply(lambda x : list([x['Crop1'],
    ...                           x['Crop2'],
    ...                           x['Crop3']]),axis=1)
    >>> df1.head()
           County   Crop1    Crop2    Crop3  Total_pop              combined_temp
    0      Harney   grain   melons   apples       2000    [grain, melons, apples]
    1       Baker  melons    grain   apples       1500    [melons, grain, apples]
    2     Wheeler  melons    grain   apples       3000    [melons, grain, apples]
    3  Hood River  apples   melons    grain       1500    [apples, melons, grain]
    4       Wasco   pears  carrots  raddish       2000  [pears, carrots, raddish]
    

    make it a sorted tuple

    >>> df1['sorted'] = df1.apply(lambda x : tuple(sorted(x['combined_temp'])),axis=1)
    >>> df1.head()
           County   Crop1    Crop2            ...             Total_pop              combined_temp                     sorted
    0      Harney   grain   melons            ...                  2000    [grain, melons, apples]    (apples, grain, melons)
    1       Baker  melons    grain            ...                  1500    [melons, grain, apples]    (apples, grain, melons)
    2     Wheeler  melons    grain            ...                  3000    [melons, grain, apples]    (apples, grain, melons)
    3  Hood River  apples   melons            ...                  1500    [apples, melons, grain]    (apples, grain, melons)
    4       Wasco   pears  carrots            ...                  2000  [pears, carrots, raddish]  (carrots, pears, raddish)
    

    then proceed to your normal group by operation

    >>> df1_grouped = df1.groupby(['sorted'])['Total_pop'].sum().reset_index()
    >>> df1_grouped
                          sorted  Total_pop
    0    (apples, grain, melons)       8000
    1  (carrots, pears, raddish)       9200
    

    Method 2: A shorted version based on the answer by aws-apprentice

    df = df1.copy()
    
    grouping_cols = ['Crop1', 'Crop2', 'Crop3']
    
    df[grouping_cols] = pd.DataFrame(df.loc[:, grouping_cols] \
                                .apply(set, axis=1) \
                                .apply(sorted)            
                                .values \
                                .tolist(), columns=grouping_cols)
    
    >>> df.head()
           County    Crop1  Crop2    Crop3  Total_pop
    0      Harney   apples  grain   melons       2000
    1       Baker   apples  grain   melons       1500
    2     Wheeler   apples  grain   melons       3000
    3  Hood River   apples  grain   melons       1500
    4       Wasco  carrots  pears  raddish       2000
    

    now take group by group by

    >>> df.groupby(grouping_cols).Total_pop.sum()
    Crop1    Crop2  Crop3  
    apples   grain  melons     8000
    carrots  pears  raddish    9200
    Name: Total_pop, dtype: int64
    

    but i personally prefer this answer using numpy