I have a dataset which includes socioeconomic indicators for students nationwide as well as their grades. More specifically, this dataset has 36 variables with about 30 million students as predictors and then the students grades as the responses.
My goal is to be able to predict whether a student will fail out (ie. be in the bottom 2%ile of the nation in terms of grades). I understand that classification with an imbalanced dataset (98% : 2%) will introduce a bias. Based on some research I planned to account for this by increasing the cost of an incorrect classification in the minority class.
Can someone please confirm that this is the correct approach (and that there isn't a better one, I'm assuming there is)? And also, given the nature of this dataset, could someone please help me choose a machine learning algorithm to accomplish this?
I am working with TensorFlow 2.0 in a Google Colab. I've compiled all the data together into a .feather file using pandas.
In case of having imbalanced dataset, using weighted class is the most common approach to do so, but having such large dataset (30M training example) for binary classification problem representing 2% for the first class and 98% for the second one, I can say it's too hard to prevent model to be unbiased against first class using weighted class as it's not too much differ from reducing the training set size to be balanced.
split your dataset set to train, evalution and test sets.
For evaluation metric I suggest these alternatives.
a. Make sure to have at least +20%, representing the first class for both evaluation and test sets.
b. Set evalution metric to be precision and recall for your model accuracy (rather than using f1 score).
c. Set evalution metric to be Cohen's kapp score (coefficient).
From my own perspective, I prefer using b.
Since you are using tensorflow, I assume that you are familiar with deep learning. so use deep learning instead of machine learning, that's gives you the ability to have many additional alternatives, anyway, here some steps for both machine learning and deep learning approach.
For Machine Leaning Algorithms
For Deep Leaning Algorithms
You have many other alternatives, From my own perspective, I may try hard to get it with 1, 3 or 5.
For Deep learning 5th approach sometimes works very well and I recommend to try it with 1, 3.