I have built a binomial glm model. The model predicts output between two potential classes: AD or Control. These variables are factors with levels: {AD, Control}. I use this model to predict and obtain probabilities for each sample, but it is unclear to me if a probability over 0.5 indicates AD or Control.
Here is my dataset:
> head(example)
cleaned_mayo$Diagnosis pca_results$x[, 1]
1052_TCX AD 0.9613241
1104_TCX AD -0.9327390
742_TCX AD 1.6908874
1945_TCX Control 0.6819104
134_TCX AD 0.5184748
11386_TCX Control 0.4669661
And here is my code to compute the model and make predictions:
# Randomize rows of top performer
example<- example[sample(nrow(example)),]
# Subset data for training and testing
N_train<- round(nrow(example)*0.75)
train<- example[1:N_train,]
test<- example[(N_train+1):nrow(example),]
colnames(train)[1:2]<- c("Diagnosis", "Eigen_gene")
colnames(test)[1:2]<- c("Diagnosis", "Eigen_gene")
# Build model and predict
model_IFGyel<- glm(Diagnosis ~ Eigen_gene, data = train, family = binomial())
pred<- predict(model_IFGyel, newdata= test, type= "response")
# Convert predictions to accuracy metric
pred[which(pred<0.5)]<- "AD"
pred[which(pred!="AD")]<- "Control"
test$Diagnosis<- as.character(test$Diagnosis)
example_acc<- sum(test$Diagnosis==pred, na.rm = T)/nrow(test)
Any help clarifying what these prediction probabilities indicate is appreciated.
From ?glm
we note:
Details:
A typical predictor has the form ‘response ~ terms’ where ‘response’ is the (numeric) response vector and ‘terms’ is a series of terms which specifies a linear predictor for ‘response’. For ‘binomial’ and ‘quasibinomial’ families the response can also be specified as a ‘factor’ (when the first level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures.
The key part is highlighted. Assuming you didn't specify the levels (i.e. R's default assignment took place), then AD
would be failures and Control
would be successes. Hence the coefficients/model would be in terms of the probability that observation is in the Control
class.
If you want to change this, use factor(...., levels = c('Control', 'AD'))
or just do 1 - prob(Control) (1 - predicted value) to get it in terms of AD
.