I am building a binary classifier like below. Can I replace the BCELoss to optimize f1 score?
criterion = nn.BCELoss()
preds = model(inputs)
loss = criterion(preds , labels)
F1 score is not a smooth function, so it cannot be optimized directly with gradient descent. With gradually changing network parameters, the output probability changes smoothly but the F1 score only changes when the probability crosses the boundary of 0.5. As a result, the gradient of F1 score is zero almost everywhere.
You can use a soft version of the F-measure as described here. The trick is that you basically replace the count of true positives and false positives with a sort of probabilistic version:
where oi is the network output and ti is the ground truth target probability. Then you continue with computing F-measure as usual.
These definitions are then used in the formula F1=2TP/(2TP+FP+FN).
Also, you might find this Kaggle tutorial useful.