Compare and rank rows in dataframe based on two columns?

I am trying to figure out how to compare and rank multiple rows in a pandas dataframe based on two conditions.

These are the conditions:

rule1 < rule2 

if support(rule1) <= support(rule2) and confidence(rule1) < confidence(rule2) 

or support(rule1) < support(rule2) and confidence(rule1) <= confidence(rule2)

    
rule1 = rule2 

if support(rule1) = support(rule2) and confidence(rule1) = confidence(rule2)

This is how my dataframe is setup:

import pandas as pd

data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}

df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])

   
  (Index)
   Rules       Support     Confidence
(4444, 5555)   0.0048      0.873015
(8747, 1254)   0.00141     0.533333
(7414, 1214)   0.0085      0.593220
(5655, 6651)   0.00106     0.012060
(4454, 3321)   0.00106     0.012060
(4893, 4923)   0.00038     0.237699
(1271, 8330)   0.00179     0.453423
(9112, 4722)   0.00913     0.097672
(4511, 6722)   0.00221     0.116983
(1102, 5789)   0.00173     0.541221
(2340, 5720)   0.00098     0.743222
(9822, 5067)   0.00024     0.378219

This how my desired dataframe will look (not sure exactly what the rank will be.... this is hypothetical ranking)

   (Index)
    Rules      Support     Confidence    Rank
(7414, 1214)   0.0085      0.593220        1
(4444, 5555)   0.0048      0.873015        2
(5655, 6651)   0.00106     0.012060        3
(4454, 3321)   0.00106     0.012060        3
(8747, 1254)   0.00141     0.533333        4
(1271, 8330)   0.00179     0.453423        5
(1102, 5789)   0.00173     0.541221        6
(2340, 5720)   0.00098     0.743222        7
(9822, 5067)   0.00024     0.378219        8
(9112, 4722)   0.00913     0.097672        9
(4511, 6722)   0.00221     0.116983        10
(4893, 4923)   0.00038     0.237699        11

I've got some idea on how to get this code working but I'm not sure how to do the comparison of every rule against every rule. I want the best rule according to the conditions to float to the top. Its not a large dataframe (< 1000) so I don't really care about speed just accuracy.

This is the code I got so far:

def rank_rules(confidence, support):

    # IF / ELSE goes here
   
    df['rank'] = some_var.rank(method='max')
  
    df.sort_values(by=['rank'], ascending=False)

    return df


df = df.apply(lambda x: rank_rules(x['confidence'], x['support']), axis=1)

Solution

Solution: A suggested approach

If I understand you correctly, you are trying to create a ranking system based on multiple columns (support, confidence). You could think of these two as two orthogonal axes (x, y) on a scatter-plot. And in the absence of further sorting-logic, I will assume that euclidean-distance is what we could use here to sort the rows to create ranks.

Process the Data

I have shown here that using a MinMaxScaler could be an option (in addition to optionally using zscore).

Code

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.preprocessing import MinMaxScaler

%matplotlib inline 
%config InlineBackend.figure_format = 'svg' # 'svg', 'retina' 
plt.style.use('seaborn-white')

df = df.reset_index(drop=False).rename(columns={'index': 'rules'})
df['distance'] = (df.support**2 + df.confidence**2)**0.5
df['zsupport'] = (df.support - df.support.mean())/df.support.std()
df['zconfidence'] = (df.confidence - df.confidence.mean())/df.confidence.std()
df['zdistance'] = (df.zsupport**2 + df.zconfidence**2)**0.5

round_strategy = {
    'support': 5,
    'confidence': 6,
    'distance': 5,
}

scaler = MinMaxScaler()
df2 = pd.DataFrame(scaler.fit_transform(df[['zsupport', 'zconfidence']]), 
                   columns=['scaled_support', 'scaled_confidence'])
df = pd.concat([df, df2], ignore_index=False, axis=1)
df['scaled_distance'] = (df.scaled_support**2 + df.scaled_confidence**2)**0.5
df = df.sort_values(['scaled_distance'], ascending=False).reset_index(drop=True)
df['Rank'] = df.index

decimals = dict()
for col in df.columns:
    for key, value in round_strategy.items():
        if key in col:
            decimals.update({col: value})
df = df.round(decimals=decimals)

sizes = (df.shape[0] - df.Rank)/df.shape[0]
colors = round(255*sizes).astype(int)
df

Plot

import plotly.express as px

fig = px.scatter(df4, x="scaled_support", y="scaled_confidence", text="Rank", 
                  log_x=False, size_max=20, 
                  color="Rank", 
                  size=(np.arange(df4.index.size) + 4)[::-1], 
                  hover_data=df4.columns)
fig.update_traces(textposition='top center')
fig.update_layout(title_text='Support vs. Confidence with Rank', title_x=0.5)
fig.show()

Dummy Data

import pandas as pd

data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}

df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])