Search code examples
pythonpandasdataframedrop

Can one drop rows in a dataframe based on nunique values?


I want to ignore the rows is the occupation has less than 2 unique names:

name        value      occupation
   a           23      mechanic
   a           24      mechanic
   b           30      mechanic
   c           40      mechanic
   c           41      mechanic
   d           30      doctor
   d           20      doctor
   e           70      plumber
   e           71      plumber
   f           30      plumber
   g           50      tailor

I did:

df.groupby('ocuupation')['name'].nunique()
>>>>>>
occupation
mechanic   3
doctor     1
plumber    2
tailor     1
Name: name, dtype: int64

Is it possible to use something like df = df.drop(df[<some boolean condition>].index)?

Desired output:

name        value      occupation
   a           23      mechanic
   a           24      mechanic
   b           30      mechanic
   c           40      mechanic
   c           41      mechanic
   e           70      plumber
   e           71      plumber
   f           30      plumber

Solution

  • Use GroupBy.transform with Series.ge for get values equal of greater like 2:

    df = df[df.groupby('occupation')['name'].transform('nunique').ge(2)]
    print (df)
      name  value occupation
    0    a     23   mechanic
    1    a     24   mechanic
    2    b     30   mechanic
    3    c     40   mechanic
    4    c     41   mechanic
    7    e     70    plumber
    8    e     71    plumber
    9    f     30    plumber
    

    Your solution is filtered values of index in Series compared in Series.isin:

    s = df.groupby('occupation')['name'].nunique()
    
    df = df[df['occupation'].isin(s[s.ge(2)].index)]
    print (df)
      name  value occupation
    0    a     23   mechanic
    1    a     24   mechanic
    2    b     30   mechanic
    3    c     40   mechanic
    4    c     41   mechanic
    7    e     70    plumber
    8    e     71    plumber
    9    f     30    plumber