I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance
You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC) roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears)) coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each student. Use the cutoff you obtained from step (1) to classify your students accordingly