Search code examples
pythonpandasanalyticsranking

Compare and rank rows in dataframe based on two columns?


I am trying to figure out how to compare and rank multiple rows in a pandas dataframe based on two conditions.

These are the conditions:

rule1 < rule2 

if support(rule1) <= support(rule2) and confidence(rule1) < confidence(rule2) 

or support(rule1) < support(rule2) and confidence(rule1) <= confidence(rule2)

    
rule1 = rule2 

if support(rule1) = support(rule2) and confidence(rule1) = confidence(rule2)

This is how my dataframe is setup:

import pandas as pd

data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}

df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])

   
  (Index)
   Rules       Support     Confidence
(4444, 5555)   0.0048      0.873015
(8747, 1254)   0.00141     0.533333
(7414, 1214)   0.0085      0.593220
(5655, 6651)   0.00106     0.012060
(4454, 3321)   0.00106     0.012060
(4893, 4923)   0.00038     0.237699
(1271, 8330)   0.00179     0.453423
(9112, 4722)   0.00913     0.097672
(4511, 6722)   0.00221     0.116983
(1102, 5789)   0.00173     0.541221
(2340, 5720)   0.00098     0.743222
(9822, 5067)   0.00024     0.378219

This how my desired dataframe will look (not sure exactly what the rank will be.... this is hypothetical ranking)

   (Index)
    Rules      Support     Confidence    Rank
(7414, 1214)   0.0085      0.593220        1
(4444, 5555)   0.0048      0.873015        2
(5655, 6651)   0.00106     0.012060        3
(4454, 3321)   0.00106     0.012060        3
(8747, 1254)   0.00141     0.533333        4
(1271, 8330)   0.00179     0.453423        5
(1102, 5789)   0.00173     0.541221        6
(2340, 5720)   0.00098     0.743222        7
(9822, 5067)   0.00024     0.378219        8
(9112, 4722)   0.00913     0.097672        9
(4511, 6722)   0.00221     0.116983        10
(4893, 4923)   0.00038     0.237699        11

I've got some idea on how to get this code working but I'm not sure how to do the comparison of every rule against every rule. I want the best rule according to the conditions to float to the top. Its not a large dataframe (< 1000) so I don't really care about speed just accuracy.

This is the code I got so far:

def rank_rules(confidence, support):

    # IF / ELSE goes here
   
    df['rank'] = some_var.rank(method='max')
  
    df.sort_values(by=['rank'], ascending=False)

    return df


df = df.apply(lambda x: rank_rules(x['confidence'], x['support']), axis=1)
 

Solution

  • Solution: A suggested approach

    If I understand you correctly, you are trying to create a ranking system based on multiple columns (support, confidence). You could think of these two as two orthogonal axes (x, y) on a scatter-plot. And in the absence of further sorting-logic, I will assume that euclidean-distance is what we could use here to sort the rows to create ranks.

    Process the Data

    I have shown here that using a MinMaxScaler could be an option (in addition to optionally using zscore).

    Code

    import numpy as np
    import pandas as pd
    import matplotlib.pylab as plt
    from sklearn.preprocessing import MinMaxScaler
    
    %matplotlib inline 
    %config InlineBackend.figure_format = 'svg' # 'svg', 'retina' 
    plt.style.use('seaborn-white')
    
    df = df.reset_index(drop=False).rename(columns={'index': 'rules'})
    df['distance'] = (df.support**2 + df.confidence**2)**0.5
    df['zsupport'] = (df.support - df.support.mean())/df.support.std()
    df['zconfidence'] = (df.confidence - df.confidence.mean())/df.confidence.std()
    df['zdistance'] = (df.zsupport**2 + df.zconfidence**2)**0.5
    
    round_strategy = {
        'support': 5,
        'confidence': 6,
        'distance': 5,
    }
    
    scaler = MinMaxScaler()
    df2 = pd.DataFrame(scaler.fit_transform(df[['zsupport', 'zconfidence']]), 
                       columns=['scaled_support', 'scaled_confidence'])
    df = pd.concat([df, df2], ignore_index=False, axis=1)
    df['scaled_distance'] = (df.scaled_support**2 + df.scaled_confidence**2)**0.5
    df = df.sort_values(['scaled_distance'], ascending=False).reset_index(drop=True)
    df['Rank'] = df.index
    
    decimals = dict()
    for col in df.columns:
        for key, value in round_strategy.items():
            if key in col:
                decimals.update({col: value})
    df = df.round(decimals=decimals)
    
    sizes = (df.shape[0] - df.Rank)/df.shape[0]
    colors = round(255*sizes).astype(int)
    df
    

    enter image description here

    Plot

    import plotly.express as px
    
    fig = px.scatter(df4, x="scaled_support", y="scaled_confidence", text="Rank", 
                      log_x=False, size_max=20, 
                      color="Rank", 
                      size=(np.arange(df4.index.size) + 4)[::-1], 
                      hover_data=df4.columns)
    fig.update_traces(textposition='top center')
    fig.update_layout(title_text='Support vs. Confidence with Rank', title_x=0.5)
    fig.show()
    

    enter image description here

    Dummy Data

    import pandas as pd
    
    data = {
    'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
    'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
    'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
    }
    
    df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])