machine-learning scikit-learn classification catboost

Catboost results don't make any sense

I'm running CatboostClassifier on an imbalanced dataset, binary classification, optimizing logloss and metric F1 Score. The resultant plot shows different results on F1:use_weights = True, F1:use_weights = False and gives different results from training predictions and validation predictions.

    params = {
    'iterations':500,
    'learning_rate':0.2,
    'eval_metric': 'F1', 
    'loss_function': 'Logloss',
    'custom_metric': ['F1', 'Precision', 'Recall'],
    'scale_pos_weight':19,
    'use_best_model':True,
    'max_depth':8
    }

    modelcat = CatBoostClassifier(**params)
    modelcat.fit(
    train_pool,
    eval_set=validation_pool,
    verbose=False,
    plot=True
    )

When I predict for validation and training set and check f1 score using sklearn's f1_score I get this score

    ypredcat0 = modelcat.predict(valX_cat0)  #validation predictions
    print(f"F1: {f1_score(y_val,ypredcat0)}")

F1: 0.4163473818646233

    ytrainpredcat0 = modelcat.predict(trainX_cat0) #training predictions
    print(f"F1: {f1_score(y_train,ytrainpredcat0)}")

F1: 0.42536905412793874

But when I look at the plot created by plot=True, I find different convergence scores

when use_weights = False

when use_weights = True

In the plots, clearly training F1 has reached the score of 1, but when making predictions it's only 0.42. Why is this different? And how is use_weights working here?

Solution

Okay I figured out an answer. The difference lies in how F1 score is calculated taking into account various averages. By default for binary classification scikit-learn uses average = 'binary', so binary F1 score is 0.42. When I changed the average = 'macro' it gave F1 score as 0.67 which is what the Catboost shows with use_weights = False. When I calculated with average = 'micro' it gave F1 score as 0.88, even higher than what the plot shows, but anyway, that solves both the questions I had.