I have a skewed dataset (5,000,000 positive examples and only 8000 negative [binary classified]) and thus, I know, accuracy is not a useful model evaluation metric. I know how to calculate precision and recall mathematically but I am unsure how to implement them in python code.
When I train the model on all the data I get 99% accuracy overall but 0% accuracy on the negative examples (ie. classifying everything as positive).
I have built my current model in Pytorch with the criterion = nn.CrossEntropyLoss()
and optimiser = optim.Adam()
.
So, my question is, how do I implement precision and recall into my training to produce the best model possible?
Thanks in advance
The implementation of precision, recall and F1 score and other metrics are usually imported from the scikit-learn library in python.
link: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
Regarding your classification task, the number of positive training samples simply eclipse the negative samples. Try training with reduced number of positive samples or generating more negative samples. I am not sure deep neural networks could provide you with an optimal result considering the class skewness.
Negative samples can be generated using the Synthetic Minority Over-sampling Technique (SMOT) technique. This link is a good place to start. Link: https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/
Try using simple models such as logistic regression or random forest first and check if there is any improvement in the F1 score of the model.