Search code examples
pythonpandasdataframeunique

How to create new column with nuniqe values per ID in other column and do not count defined value in Python Pandas?


I have DataFrame in Python Pandas like below (ID and COL1 is "object" data type):

ID COL1
123 XX
123 XX
123 L
456 AP
456 AP
77 C
88 LACK
66 LACK
66 G

And I need to create new column "COL1_bis" where will be infromation how many unique values in "COL1" has each ID. But if some ID has value "LACK" in "COL1" then we do not count that

So as a result I need something like below:

ID COL1 COL1_bis
123 XX 2
123 XX 2
123 L 2
456 AP 1
456 AP 1
77 C 1
88 LACK 0
66 LACK 1
66 GG 1

Because:

  • ID = 123 has 2 in COL1_bis, because this ID has 2 different (unique) values in COL1: "XX" and "L"
  • ID = 456 has 1 in COL1_bis, because this ID has 1 unique value in COL1: "AP"
  • ID = 77 has 1 in COL1_bis, because this ID has 1 unique value in COL1: "C"
  • ID = 88 has 0 in COL1_bis, because this ID has value "LACK" in COL1, so we do not count this value
  • ID = 66 has 1 in COL1_bis, because this ID has "LACK" in COL1 and "GG" so we count only "GG"

How can I do that in Python Pandas?


Solution

  • Use Series.mask for replace LACK to missing values and then use GroupBy.transform with DataFrameGroupBy.nunique:

    df['COL1_bis'] = (df['COL1'].mask(df['COL1'].eq('LACK'))
                                .groupby(df['ID'])
                                .transform('nunique'))
    print (df)
        ID  COL1  COL1_bis
    0  123    XX         2
    1  123    XX         2
    2  123     L         2
    3  456    AP         1
    4  456    AP         1
    5   77     C         1
    6   88  LACK         0
    7   66  LACK         1
    8   66     G         1