Search code examples
machine-learningxgboostlightgbmranking-models

XGBoost/ XGBRanker to produce probabilities instead of ranking scores


I have a dataset of the performance of students in exams which looks like:

Class_ID   Class_size   Student_Number   IQ   Hours_Studied   Score
1          3            3                101  10              98
1          3            4                99   19              80
1          3            6                130  3               95
2          4            4                93   5               50
2          4            5                103  9               88
2          4            8                112  12              99
2          4            1                200  10              100 

and I would like to build a machine learning model trying to predict who is going to be top of the class (i.e. highest Score) for any given Class_ID using the IQ and Hours_Studied as features.

Since this is a ranking problem, a natural class of learning models is to use the XGBRanker in XGBoost or LGBMRanker in lightgbm.

and here is my code using xgboost:

from sklearn.model_selection import GroupShuffleSplit
import xgboost as xgb

gss = GroupShuffleSplit(test_size=.40, n_splits=1, random_state = 7).split(df, groups=df['Class_ID'])

X_train_inds, X_test_inds = next(gss)

train_data = df.iloc[X_train_inds]
X_train = train_data.loc[:, ~train_data.columns.isin(['Class_ID','Student_Number','Score'])]
y_train = train_data.loc[:, train_data.columns.isin(['Score'])]

groups = train_data.groupby('Class_ID').size().to_frame('Class_size')['Class_size'].to_numpy()

test_data = df.iloc[X_test_inds]

X_test = test_data.loc[:, ~test_data.columns.isin(['Student_Number','Score'])]
y_test = test_data.loc[:, test_data.columns.isin(['Score'])]

model = xgb.XGBRanker(  
    tree_method='hist',
    device='cuda',
    booster='gbtree',
    objective='rank:pairwise',
    enable_categorical=True,
    random_state=42, 
    learning_rate=0.1,
    colsample_bytree=0.9, 
    eta=0.05, 
    max_depth=6, 
    n_estimators=175, 
    subsample=0.75 
    )

model.fit(X_train, y_train, group=groups, verbose=True)

def predict(model, df):
  return model.predict(df.loc[:, ~df.columns.isin(['Class_ID','Student_Number'])])
  
predictions = (X_test.groupby('Class_ID')
                     .apply(lambda x: predict(model, x)))

The code works fine with reasonable predictive power. However, the output is a list of "relevance score" as opposed to a list of probabilities. But it seems both XGBRanker and LGBMRanker do not have the attribute predict_proba that return the probability of getting the highest score in the class.

So my question is, is there any way to convert the relevance score into probabilities or are there any other natural classes of ranking models that deal with these kind of problems?

Edit in this problem, I only care about the person who ends up the top of the class (or maybe the top 3) so the ranking isn't all that important (for example knowing student 4 ranks 11th and student 8 ranks 12 does not matter all that much), so I suppose one way is to use classification instead of ranking in xgboost. But I wonder is there any other way.


Solution

  • If you do not care about the score itself, but do care about who is the best in class, you should use ranker + binary classes as a target. Create a column "Is_first", and set it to 1 for the best students in each class and to 0 for the rest, then fit the ranker (not a classifier).

    In this case the result is going to be be a value between 1 and 0, and model will consider all the students in class to predict who is the best. Not sure if result values will sum up to 1 for all the students in class (but you can write some code to normalize values if they do not sum up to 1).