Search code examples
rclassificationlogistic-regressionglm

R code: Error in model.matrix.default(mt, mf, contrasts) : Variable 1 has no levels


I am trying to build a logistic regression model with a response as diagnosis ( 2 Factor variable: B, M). I am getting an Error on building a logistic regression model:

Error in model.matrix.default(mt, mf, contrasts) : 
  variable 1 has no levels

I am not able to figure out how to solve this issue.

R Code:

Cancer <- read.csv("Breast_Cancer.csv")


## Logistic Regression Model

lm.fit <- glm(diagnosis~.-id-X, data = Cancer, family = binomial)
summary(lm.fit)

Dataset Reference: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data


Solution

  • Your problem is similar to the one reported here on the randomForest classifier.
    Apparently glm checks through the variables in your data and throws an error because X contains only NA values.

    You can fix that error by

    1. either by dropping X completely from your dataset, setting Cancer$X <- NULL before handing it to glm and leaving X out in your formula (glm(diagnosis~.-id, data = Cancer, family = binomial));
    2. or by adding na.action = na.pass to the glm call (which will instruct to ignore the NA-warning, essentially) but still excluding X in the formula itself (glm(diagnosis~.-id-X, data = Cancer, family = binomial, na.action = na.pass))

    However, please note that still, you'd have to make sure to provide the diagnosis variable in a form digestible by glm. Meaning: either a numeric vector with values 0 and 1, a logical or a factor-vector

    "For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success)" - from the glm-doc

    Just define Cancer$diagnosis <- as.factor(Cancer$diagnosis).

    On my end, this still leaves some warnings, but I think those are coming from the data or your feature selection. It clears the blocking errors :)