Search code examples
rregressionpredictionglm

GLM regression prediction- understanding which factor level is success


I have built a binomial glm model. The model predicts output between two potential classes: AD or Control. These variables are factors with levels: {AD, Control}. I use this model to predict and obtain probabilities for each sample, but it is unclear to me if a probability over 0.5 indicates AD or Control.

Here is my dataset:

> head(example)
          cleaned_mayo$Diagnosis pca_results$x[, 1]
1052_TCX                      AD          0.9613241
1104_TCX                      AD         -0.9327390
742_TCX                       AD          1.6908874
1945_TCX                 Control          0.6819104
134_TCX                       AD          0.5184748
11386_TCX                Control          0.4669661

And here is my code to compute the model and make predictions:

# Randomize rows of top performer
example<- example[sample(nrow(example)),]

# Subset data for training and testing
N_train<- round(nrow(example)*0.75)
train<- example[1:N_train,]
test<- example[(N_train+1):nrow(example),]
colnames(train)[1:2]<- c("Diagnosis", "Eigen_gene")
colnames(test)[1:2]<- c("Diagnosis", "Eigen_gene")

# Build model and predict   
model_IFGyel<- glm(Diagnosis ~ Eigen_gene, data = train, family = binomial())
pred<- predict(model_IFGyel, newdata= test, type= "response")

# Convert predictions to accuracy metric
pred[which(pred<0.5)]<- "AD"
pred[which(pred!="AD")]<- "Control"
test$Diagnosis<- as.character(test$Diagnosis)
example_acc<- sum(test$Diagnosis==pred, na.rm = T)/nrow(test)

Any help clarifying what these prediction probabilities indicate is appreciated.


Solution

  • From ?glm we note:

    Details:

    A typical predictor has the form ‘response ~ terms’ where ‘response’ is the (numeric) response vector and ‘terms’ is a series of terms which specifies a linear predictor for ‘response’. For ‘binomial’ and ‘quasibinomial’ families the response can also be specified as a ‘factor’ (when the first level denotes failure and all others success) or as a two-column matrix with the columns giving the numbers of successes and failures.

    The key part is highlighted. Assuming you didn't specify the levels (i.e. R's default assignment took place), then AD would be failures and Control would be successes. Hence the coefficients/model would be in terms of the probability that observation is in the Control class.

    If you want to change this, use factor(...., levels = c('Control', 'AD')) or just do 1 - prob(Control) (1 - predicted value) to get it in terms of AD.