Search code examples
python-3.xpandasdataframegroup-by

How to select conditional rows and only return one result with groupby? (Python)


I want to get the only row after groupby with the certain rules(

  • get the tier with rank = 1 if tier != Unknown. (a001)
  • Or get the tier with rank =2 (b001)
  • return tier = 'no_tier' if rank = 1 and tier = Unknown (c001)

** "Unknown" would only show up once for the tier of each user_id

Here is the dataframe

user_id tier    rank
a001    High     1
a001    Low      2
a001    Unknown  3
b001    Unknown  1 
b001    Mid      2
c001    Unknown  1

expected result

tier    
High     
Mid      
no_tier  

I was trying to write down a function like

def get_tier(x):
    if x['rank'] ==1 and x['tier'] != 'Unknown':
        return x['tier']
    elif x['rank'] == 2:
        return x['tier']
    else:
        return 'no_tier'

df.goupby('user_id').apply(lambda x : x.apply(get_tier), axis = 1).iloc[0]).reset_index()


but seems like it cant work

Solution

  • Given your conditions and assuming that the ranks are already sorted in ascending order (per group or globally), you can just replace unknown with NA, get the first non-NA value with groupby.first, then fillna with :

    out = (df
           #.sort_values(by='rank')  # uncomment if not originally sorted
           ['tier']
           .replace('Unknown', pd.NA)
           .groupby(df['user_id']).first()
           .fillna('no_tier')
          )
    

    Output:

    user_id
    a001       High
    b001        Mid
    c001    no_tier
    Name: tier, dtype: object
    

    If you rather want the original rows, you can go with a groupby.idxmin to the the min rank per group after masking the Unknown ranks with large number (e.g. Inf):

    out = (df.loc[df['rank'].mask(df['tier'].eq('Unknown'), float('inf'))
                 .groupby(df['user_id']).idxmin()]
             .replace({'tier': {'Unknown': 'no_tier'}})
          )
    

    Output:

      user_id     tier  rank
    0    a001     High     1
    4    b001      Mid     2
    5    c001  no_tier     1