Search code examples
pythonpython-3.xpandasnumpymax

Getting idxmax values in pandas groupby in case of equal values or ties


I am trying to identify the most probable event (transportation) of customers. This is an example df:

df = pd.DataFrame({'customer id':[1,1,1,2,2,2,2,2,2,2,3,3],
                   'Trans':['Car','Car','Bus','Bus','Bus','Bus','Car','Car','Car','Plane','Car','Bus']})

gives:

    customer id  Trans
0             1    Car
1             1    Car
2             1    Bus
3             2    Bus
4             2    Bus
5             2    Bus
6             2    Car
7             2    Car
8             2    Car
9             2  Plane
10            3    Car
11            3    Bus

I did the following steps to get the largest probable values for each customer id:

#Get the count & percent of each customer id transportation
df2 = df.groupby(['customer id','Trans'])['Trans'].size().reset_index(name='Trans Counts')
df2['percent'] = df2.groupby("customer id")['Trans Counts'].transform(lambda x: (x / x.sum()).round(2))

#Get records > 1
df3 = df2[df2['Trans Counts'] > 1]

So I got this df:

  customer id Trans  Trans Counts  percent
1            1   Car             2     0.67
2            2   Bus             3     0.43
3            2   Car             3     0.43

There is a tie for customer id 2, so when I use idxmax() :

df3.loc[df3.groupby('customer id')['Trans Counts'].idxmax()]

It only shows the first row:

   customer id Trans  Trans Counts  percent
1            1   Car             2     0.67
2            2   Bus             3     0.43

How can I get the top 2 records within the same group in case of a tie? I also used try nlargest(2) but I got the results for the whole df not within the grouping & if I used it within the aggregate function, it does not show the expected output!

Expected output in case of a tie:

   customer id Trans            Trans Counts  percent
1            1   Car                      2     0.67
2            2   **Bus, Car**             3     0.43

Thanks


Solution

  • This is how you can assign the new column based on the maximum transportation type

    In [144]: df2 = pd.get_dummies(df).groupby("customer id").sum()
    #             Trans_Bus  Trans_Car  Trans_Plane
    #customer id                                   
    #1                    1          2            0
    #2                    3          3            1
    #3                    1          1            0
    
    
    In [145]: df2['max_cols'] = [ ",".join([col for col in df2 if df2.lo
         ...: c[i,col]==mx]) for i,mx in df2.max(axis=1).items()]
    
    In [146]: df2
    Out[146]: 
                 Trans_Bus  Trans_Car  Trans_Plane             max_cols
    customer id                                                        
    1                    1          2            0            Trans_Car
    2                    3          3            1  Trans_Bus,Trans_Car
    3                    1          1            0  Trans_Bus,Trans_Car
    

    More or less the same approach, but adding percentage and max count as columns:

    In [4]: df2 = pd.get_dummies(df).groupby("customer id").sum()
    
    In [5]: trans_count = df2.max(axis=1)
    
    In [6]: trans_perc = trans_count / df2.sum(axis=1)
    
    In [7]: trans_type = [ ",".join([col for col in df2 if df2.loc[i,col]=
       ...: =mx]) for i,mx in trans_count.items()]
    
    In [8]: pd.DataFrame({"count":trans_count, "perc":trans_perc, "kind":t
       ...: rans_type}, index=df2.index)
    Out[8]: 
                 count      perc                 kind
    customer id                                      
    1                2  0.666667            Trans_Car
    2                3  0.428571  Trans_Bus,Trans_Car
    3                1  0.500000  Trans_Bus,Trans_Car
    
    

    Note, you can call reset_index() on the resulting dataframe to move customer id from index to column