I am trying to figure out how to compare and rank multiple rows in a pandas dataframe based on two conditions.
These are the conditions:
rule1 < rule2
if support(rule1) <= support(rule2) and confidence(rule1) < confidence(rule2)
or support(rule1) < support(rule2) and confidence(rule1) <= confidence(rule2)
rule1 = rule2
if support(rule1) = support(rule2) and confidence(rule1) = confidence(rule2)
This is how my dataframe is setup:
import pandas as pd
data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}
df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])
(Index)
Rules Support Confidence
(4444, 5555) 0.0048 0.873015
(8747, 1254) 0.00141 0.533333
(7414, 1214) 0.0085 0.593220
(5655, 6651) 0.00106 0.012060
(4454, 3321) 0.00106 0.012060
(4893, 4923) 0.00038 0.237699
(1271, 8330) 0.00179 0.453423
(9112, 4722) 0.00913 0.097672
(4511, 6722) 0.00221 0.116983
(1102, 5789) 0.00173 0.541221
(2340, 5720) 0.00098 0.743222
(9822, 5067) 0.00024 0.378219
This how my desired dataframe will look (not sure exactly what the rank will be.... this is hypothetical ranking)
(Index)
Rules Support Confidence Rank
(7414, 1214) 0.0085 0.593220 1
(4444, 5555) 0.0048 0.873015 2
(5655, 6651) 0.00106 0.012060 3
(4454, 3321) 0.00106 0.012060 3
(8747, 1254) 0.00141 0.533333 4
(1271, 8330) 0.00179 0.453423 5
(1102, 5789) 0.00173 0.541221 6
(2340, 5720) 0.00098 0.743222 7
(9822, 5067) 0.00024 0.378219 8
(9112, 4722) 0.00913 0.097672 9
(4511, 6722) 0.00221 0.116983 10
(4893, 4923) 0.00038 0.237699 11
I've got some idea on how to get this code working but I'm not sure how to do the comparison of every rule against every rule. I want the best rule according to the conditions to float to the top. Its not a large dataframe (< 1000) so I don't really care about speed just accuracy.
This is the code I got so far:
def rank_rules(confidence, support):
# IF / ELSE goes here
df['rank'] = some_var.rank(method='max')
df.sort_values(by=['rank'], ascending=False)
return df
df = df.apply(lambda x: rank_rules(x['confidence'], x['support']), axis=1)
If I understand you correctly, you are trying to create a ranking system based on multiple columns (support, confidence). You could think of these two as two orthogonal axes (x
, y
) on a scatter-plot. And in the absence of further sorting-logic, I will assume that euclidean-distance is what we could use here to sort the rows to create ranks.
I have shown here that using a MinMaxScaler
could be an option (in addition to optionally using zscore
).
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # 'svg', 'retina'
plt.style.use('seaborn-white')
df = df.reset_index(drop=False).rename(columns={'index': 'rules'})
df['distance'] = (df.support**2 + df.confidence**2)**0.5
df['zsupport'] = (df.support - df.support.mean())/df.support.std()
df['zconfidence'] = (df.confidence - df.confidence.mean())/df.confidence.std()
df['zdistance'] = (df.zsupport**2 + df.zconfidence**2)**0.5
round_strategy = {
'support': 5,
'confidence': 6,
'distance': 5,
}
scaler = MinMaxScaler()
df2 = pd.DataFrame(scaler.fit_transform(df[['zsupport', 'zconfidence']]),
columns=['scaled_support', 'scaled_confidence'])
df = pd.concat([df, df2], ignore_index=False, axis=1)
df['scaled_distance'] = (df.scaled_support**2 + df.scaled_confidence**2)**0.5
df = df.sort_values(['scaled_distance'], ascending=False).reset_index(drop=True)
df['Rank'] = df.index
decimals = dict()
for col in df.columns:
for key, value in round_strategy.items():
if key in col:
decimals.update({col: value})
df = df.round(decimals=decimals)
sizes = (df.shape[0] - df.Rank)/df.shape[0]
colors = round(255*sizes).astype(int)
df
import plotly.express as px
fig = px.scatter(df4, x="scaled_support", y="scaled_confidence", text="Rank",
log_x=False, size_max=20,
color="Rank",
size=(np.arange(df4.index.size) + 4)[::-1],
hover_data=df4.columns)
fig.update_traces(textposition='top center')
fig.update_layout(title_text='Support vs. Confidence with Rank', title_x=0.5)
fig.show()
import pandas as pd
data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}
df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])