I'm running CatboostClassifier
on an imbalanced dataset, binary classification, optimizing logloss and metric F1 Score. The resultant plot shows different results on F1:use_weights = True, F1:use_weights = False and gives different results from training predictions and validation predictions.
params = {
'iterations':500,
'learning_rate':0.2,
'eval_metric': 'F1',
'loss_function': 'Logloss',
'custom_metric': ['F1', 'Precision', 'Recall'],
'scale_pos_weight':19,
'use_best_model':True,
'max_depth':8
}
modelcat = CatBoostClassifier(**params)
modelcat.fit(
train_pool,
eval_set=validation_pool,
verbose=False,
plot=True
)
When I predict for validation and training set and check f1 score using sklearn's f1_score
I get this score
ypredcat0 = modelcat.predict(valX_cat0) #validation predictions
print(f"F1: {f1_score(y_val,ypredcat0)}")
F1: 0.4163473818646233
ytrainpredcat0 = modelcat.predict(trainX_cat0) #training predictions
print(f"F1: {f1_score(y_train,ytrainpredcat0)}")
F1: 0.42536905412793874
But when I look at the plot created by plot=True
, I find different convergence scores
when use_weights = False
when use_weights = True
In the plots, clearly training F1 has reached the score of 1, but when making predictions it's only 0.42. Why is this different? And how is use_weights
working here?
Okay I figured out an answer. The difference lies in how F1 score is calculated taking into account various averages. By default for binary classification scikit-learn uses average = 'binary'
, so binary F1 score is 0.42. When I changed the average = 'macro'
it gave F1 score as 0.67 which is what the Catboost shows with use_weights = False
. When I calculated with average = 'micro'
it gave F1 score as 0.88, even higher than what the plot shows, but anyway, that solves both the questions I had.