Search code examples
pythonmachine-learningpytorchprecision-recall

Using Precision and Recall in training of skewed dataset


I have a skewed dataset (5,000,000 positive examples and only 8000 negative [binary classified]) and thus, I know, accuracy is not a useful model evaluation metric. I know how to calculate precision and recall mathematically but I am unsure how to implement them in python code.

When I train the model on all the data I get 99% accuracy overall but 0% accuracy on the negative examples (ie. classifying everything as positive).

I have built my current model in Pytorch with the criterion = nn.CrossEntropyLoss() and optimiser = optim.Adam().

So, my question is, how do I implement precision and recall into my training to produce the best model possible?

Thanks in advance


Solution

  • The implementation of precision, recall and F1 score and other metrics are usually imported from the scikit-learn library in python.

    link: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

    Regarding your classification task, the number of positive training samples simply eclipse the negative samples. Try training with reduced number of positive samples or generating more negative samples. I am not sure deep neural networks could provide you with an optimal result considering the class skewness.

    Negative samples can be generated using the Synthetic Minority Over-sampling Technique (SMOT) technique. This link is a good place to start. Link: https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/

    Try using simple models such as logistic regression or random forest first and check if there is any improvement in the F1 score of the model.