Search code examples
pythonpandasdataframepandas-groupbyrank

How can I rank based on condition in Pandas


Supposed, I have Pandas DataFrame looks like below:

Cluster Variable Group Ratio Value
1 GDP_M3 GDP 20% 70%
1 HPI_M6 HPI 40% 80%
1 GDP_lg2 GDP 35% 50%
2 CPI_M9 CPI 10% 50%
2 HPI_lg6 HPI 15% 65%
3 CPI_lg12 CPI 15% 90%
3 CPI_lg1 CPI 20% 95%

I would like to rank Variable based on Ratio and Value in the separated columns. The Ratio will rank from the lowest to the highest, while the Value will rank from the highest to the lowest.

There are some variables that I do not want to rank. In the example, I do not prefer CPI. Any type of CPI will not be considered for the rank e.g., CPI_M9. However, the case will be expected only if there is only that particular variable in the Cluster.

The results from condition above will look like the table below:

Cluster Variable Group Ratio Value RankRatio RankValue
1 GDP_M3 GDP 20% 70% 1 2
1 HPI_M6 HPI 40% 80% 3 1
1 GDP_lg2 GDP 35% 50% 2 3
2 CPI_M9 CPI 10% 50% NaN NaN
2 HPI_lg6 HPI 15% 65% 1 1
3 CPI_lg12 CPI 15% 90% 1 2
3 CPI_lg1 CPI 20% 95% 2 1

For Cluster 1, the GDP_M3 has the lowest Ratio at 20%, while the HPI_M3 has the highest Value at 80%. Thus, both of them will be assigned rank 1 and the others will be followed subsequently.

For Cluster 2, even CPI_M9 has the lowest Ratio but the CPI is not prefer. Thus, the rank 1 will be assigned to HPI_lg6.

For Cluster 3, there are variables from the only CPI Group and there is no other options to rank. Thus, the CPI_lg12 and CPI_lg1 are ranked based on the lowest Ratio and the highest Value.

df['RankRatio'] = df.groupby(['Cluster'])['Ratio'].rank(method = 'first', ascending = True)
df['RankValue'] = df.groupby(['Cluster'])['Value'].rank(method = 'first', ascending = False)

I have some code that can be handled only general case but for specific case with unprefer group of variables, my code cannot handle it.

Please help or suggest on this. Thank you.


Solution

  • Use:

    #convert columns to numeric
    df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)
    

    Remove row with CPI by condition - test rows if no only CPI per Cluster:

    m = df['Group'].eq('CPI')
    m1 = ~df['Cluster'].isin(df.loc[m, 'Cluster']) | m
    df['RankRatio'] = df[m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
    df['RankValue'] = df[m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)
    
    
    print (df)
       Cluster  Variable Group  Ratio  Value  RankRatio  RankValue
    0        1    GDP_M3   GDP   20.0   70.0        1.0        2.0
    1        1    HPI_M6   HPI   40.0   80.0        3.0        1.0
    2        1   GDP_lg2   GDP   35.0   50.0        2.0        3.0
    3        2    CPI_M9   CPI   10.0   50.0        NaN        NaN
    4        2   HPI_lg6   HPI   15.0   65.0        1.0        1.0
    5        3  CPI_lg12   CPI   15.0   90.0        1.0        2.0
    6        3   CPI_lg1   CPI   20.0   95.0        2.0        1.0
    

    How it working:

    For mask2 are filter all Cluster values if match mask1 and filtered original column Cluster, then invert mask by ~. Last chain both conditions by | for bitwise OR for all rows without CPI if exist with another values per Cluster:

    print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[m, 'Cluster']), both = m1))
       Cluster  Variable Group  Ratio  Value  mask1  mask2   both
    0        1    GDP_M3   GDP   20.0   70.0  False   True   True
    1        1    HPI_M6   HPI   40.0   80.0  False   True   True
    2        1   GDP_lg2   GDP   35.0   50.0  False   True   True
    3        2    CPI_M9   CPI   10.0   50.0   True  False   True
    4        2   HPI_lg6   HPI   15.0   65.0  False  False  False
    5        3  CPI_lg12   CPI   15.0   90.0   True  False   True
    6        3   CPI_lg1   CPI   20.0   95.0   True  False   True
    

    EDIT:

    df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)
    
    m = df['Group'].isin(['CPI','HPI'])
    m2 = df.groupby('Cluster')['Group'].transform('nunique').ne(1)
    m1 = (~df['Cluster'].isin(df.loc[~m, 'Cluster']) | m) & m2
    df['RankRatio'] = df[~m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
    df['RankValue'] = df[~m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)
    print (df)
       Cluster  Variable Group  Ratio  Value  RankRatio  RankValue
    0        1    GDP_M3   GDP   20.0   70.0        1.0        1.0
    1        1    HPI_M6   HPI   40.0   80.0        NaN        NaN
    2        1   GDP_lg2   GDP   35.0   50.0        2.0        2.0
    3        2    CPI_M9   CPI   10.0   50.0        NaN        NaN
    4        2   HPI_lg6   HPI   15.0   65.0        NaN        NaN
    5        3  CPI_lg12   CPI   15.0   90.0        1.0        2.0
    6        3   CPI_lg1   CPI   20.0   95.0        2.0        1.0
    

    print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[~m, 'Cluster']), m2=m2, all = ~m1))
       Cluster  Variable Group  Ratio  Value  RankRatio  RankValue  mask1  mask2  \
    0        1    GDP_M3   GDP   20.0   70.0        1.0        1.0  False  False   
    1        1    HPI_M6   HPI   40.0   80.0        NaN        NaN   True  False   
    2        1   GDP_lg2   GDP   35.0   50.0        2.0        2.0  False  False   
    3        2    CPI_M9   CPI   10.0   50.0        NaN        NaN   True   True   
    4        2   HPI_lg6   HPI   15.0   65.0        NaN        NaN   True   True   
    5        3  CPI_lg12   CPI   15.0   90.0        1.0        2.0   True   True   
    6        3   CPI_lg1   CPI   20.0   95.0        2.0        1.0   True   True   
    
          m2    all  
    0   True   True  
    1   True  False  
    2   True   True  
    3   True  False  
    4   True  False  
    5  False   True  
    6  False   True