I am trying to build a logistic regression model with a response as diagnosis ( 2 Factor variable: B, M). I am getting an Error on building a logistic regression model:
Error in model.matrix.default(mt, mf, contrasts) :
variable 1 has no levels
I am not able to figure out how to solve this issue.
R Code:
Cancer <- read.csv("Breast_Cancer.csv")
## Logistic Regression Model
lm.fit <- glm(diagnosis~.-id-X, data = Cancer, family = binomial)
summary(lm.fit)
Dataset Reference: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Your problem is similar to the one reported here on the randomForest
classifier.
Apparently glm
checks through the variables in your data and throws an error because X contains only NA
values.
You can fix that error by
Cancer$X <- NULL
before handing it to glm
and leaving X
out in your formula (glm(diagnosis~.-id, data = Cancer, family = binomial)
);na.action = na.pass
to the glm
call (which will instruct to ignore the NA-warning, essentially) but still excluding X in the formula itself (glm(diagnosis~.-id-X, data = Cancer, family = binomial, na.action = na.pass)
)However, please note that still, you'd have to make sure to provide the diagnosis
variable in a form digestible by glm
. Meaning: either a numeric vector with values 0 and 1, a logical or a factor-vector
"For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success)" - from the
glm
-doc
Just define Cancer$diagnosis <- as.factor(Cancer$diagnosis)
.
On my end, this still leaves some warnings, but I think those are coming from the data or your feature selection. It clears the blocking errors :)