Search code examples
pythonpandaspandas-groupby

Pandas pick the higher value for each unique id


I have a df of customers

CUST_ID | SEGMENT | AREA
  1     |  B      | CAD
  1     |  A      | RAM
  2     |  B      | CAD
  2     |  C      | RAM
  3     |  B      | RAM
  4     |  A      | RAM

I want to count the unique number of CUST_ID per SEGMENT so I did

df.groupby(['SEGMENT'])['CUST_ID'].nunique()

However if there are same CUST_ID with different SEGMENT types then the number per SEGMENT gets inflated. I want to pick the highest value SEGMENT per CUST_ID and then count. A being the highest and C being the lowest. So the resulting df would look like:

CUST_ID | SEGMENT | AREA
  1     |  A      | RAM
  2     |  B      | CAD
  3     |  B      | RAM
  4     |  A      | RAM

and the count would be

  • A - 2
  • B - 2
  • C - 0

How would I be able to do this?


Solution

  • You can try groupby CUST_ID column then filter rows by getting the min value of SEGMENT column.

    out = (df.groupby(['CUST_ID'])
           .apply(lambda g: g[g['SEGMENT'].eq(g['SEGMENT'].min())])
           .reset_index(drop=True))
    

    NOTE: Since you want to pick the highest value SEGMENT per CUST_ID and then count, A being the highest and C being the lowest, in ASCII talbe, A is 65, C is 67. When comparing, A actually is smaller than C. That's why use min here.

    print(out)
    
       CUST_ID SEGMENT AREA
    0        1       A  RAM
    1        2       B  CAD
    2        3       B  RAM
    3        4       A  RAM
    
    res = out.value_counts('SEGMENT')
    
    print(res)
    
    A    2
    B    2
    Name: SEGMENT, dtype: int64