I have a dataset of the performance of students in exams which looks like:
Class_ID Class_size Student_Number IQ Hours_Studied Score
1 3 3 101 10 98
1 3 4 99 19 80
1 3 6 130 3 95
2 4 4 93 5 50
2 4 5 103 9 88
2 4 8 112 12 99
2 4 1 200 10 100
and I would like to build a machine learning model trying to predict who is going to be top of the class (i.e. highest Score
) for any given Class_ID
using the IQ
and Hours_Studied
as features.
Since this is a ranking problem, a natural class of learning models is to use the XGBRanker
in XGBoost
or LGBMRanker
in lightgbm
.
and here is my code using xgboost:
from sklearn.model_selection import GroupShuffleSplit
import xgboost as xgb
gss = GroupShuffleSplit(test_size=.40, n_splits=1, random_state = 7).split(df, groups=df['Class_ID'])
X_train_inds, X_test_inds = next(gss)
train_data = df.iloc[X_train_inds]
X_train = train_data.loc[:, ~train_data.columns.isin(['Class_ID','Student_Number','Score'])]
y_train = train_data.loc[:, train_data.columns.isin(['Score'])]
groups = train_data.groupby('Class_ID').size().to_frame('Class_size')['Class_size'].to_numpy()
test_data = df.iloc[X_test_inds]
X_test = test_data.loc[:, ~test_data.columns.isin(['Student_Number','Score'])]
y_test = test_data.loc[:, test_data.columns.isin(['Score'])]
model = xgb.XGBRanker(
tree_method='hist',
device='cuda',
booster='gbtree',
objective='rank:pairwise',
enable_categorical=True,
random_state=42,
learning_rate=0.1,
colsample_bytree=0.9,
eta=0.05,
max_depth=6,
n_estimators=175,
subsample=0.75
)
model.fit(X_train, y_train, group=groups, verbose=True)
def predict(model, df):
return model.predict(df.loc[:, ~df.columns.isin(['Class_ID','Student_Number'])])
predictions = (X_test.groupby('Class_ID')
.apply(lambda x: predict(model, x)))
The code works fine with reasonable predictive power. However, the output is a list of "relevance score" as opposed to a list of probabilities. But it seems both XGBRanker
and LGBMRanker
do not have the attribute predict_proba
that return the probability of getting the highest score in the class.
So my question is, is there any way to convert the relevance score
into probabilities or are there any other natural classes of ranking models that deal with these kind of problems?
Edit in this problem, I only care about the person who ends up the top of the class (or maybe the top 3) so the ranking isn't all that important (for example knowing student 4 ranks 11th and student 8 ranks 12 does not matter all that much), so I suppose one way is to use classification instead of ranking in xgboost. But I wonder is there any other way.
If you do not care about the score itself, but do care about who is the best in class, you should use ranker + binary classes as a target. Create a column "Is_first", and set it to 1 for the best students in each class and to 0 for the rest, then fit the ranker (not a classifier).
In this case the result is going to be be a value between 1 and 0, and model will consider all the students in class to predict who is the best. Not sure if result values will sum up to 1 for all the students in class (but you can write some code to normalize values if they do not sum up to 1).