Search code examples
pythonpandasgroup-bymulti-index

Pandas Groupby: selection where both subgroups exist


My DataFrame has the following form.

id group color
i1 aa    white
i1 aa    white
i1 ab    white
i1 ab    black
...

I apply a groupby as following:

groupdf = df.groupby(['id', 'group'])['color'].value_counts()

The result of the groupby has an multiindex.

               value
id group color 

i1  aa   white  2
i1  ab   white  1
i1  ab   black  3
i1  ac   black  5
i1  ad   white  4
i1  ad   black  5

i2  aa   white  1
i2  aa   black  1
i2  bb   black  1
i2  cc   white  2
i2  cc   black  6
i2  ad   black  5  

My goal is to

  1. select entries where both categories of the last index color are present and then
  2. select the group with max value of the black color So the result would look like:
    value
id

i1  5    #only groups ab and ad have both colors; ad.black = 5 > ab.black = 3
i2  6    #only groups aa and cc have both colors; cc.black = 6 > aa.black = 1

I have tried .xs() as well as .index.get_level_values() but I am not able to achieve my goal.

EDIT 1: I see that I have provided poor information how I got the DataFrame and updated it above. I can't directly plugin .max() as the original df does not have value column.


Solution

  • Let's try:

    # mask the groups with more than one colors
    s = df.groupby(['id','group'])['value'].transform('size') > 1
    
    
    # boolean index the groups and query, then another groupby with max
    df[s].query('color=="black"').groupby(['id','color'])['value'].max()
    

    Output:

    id  color
    i1  black    5
    i2  black    6
    Name: value, dtype: int64