I was building a logistic regression model in r but when I checked the coefficients using summary(model) the output displayed NA's in the four columns (estimate, standard error, z value and z) for one of my independent variables. My other three variables worked fine.
I also checked for any null values but there were none. I changed it between a continuous and discrete value using as.numeric and as.integer but it still comes out as NA in the output. The variable itself measures total volume of blood donated.
I can't figure this out and it is bothering me. Thanks
Here is an example elaborating on the comment I made above; I'm using a simple linear model here, but the same principle applies for your logistic regression model.
Let's generate some data: We generate data for a model y = x1 + x2 + epsilon
, where the two predictor variables x1
and x2
are linearly dependent: x2 = 2.5 * x1
.
# Generate sample data
set.seed(2017);
x1 <- seq(1, 100);
x2 <- 2.5 * x1;
y <- x1 + x2 + rnorm(100);
We fit the model.
df <- cbind.data.frame(x1 = x1, x2 = x2, y = y);
fit <- lm(y ~ x1 + x2, df);
Look at parameter estimates.
summary(fit);
#
#Call:
#lm(formula = y ~ x1 + x2, data = df)
#
#Residuals:
# Min 1Q Median 3Q Max
#-2.50288 -0.75360 -0.01388 0.67935 3.08515
#
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.166567 0.215534 0.773 0.441
#x1 3.496831 0.003705 943.719 <2e-16 ***
#x2 NA NA NA NA
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 1.07 on 98 degrees of freedom
#Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
#F-statistic: 8.906e+05 on 1 and 98 DF, p-value: < 2.2e-16
You can see that estimates for x2
are NA
. This is a direct consequence of x1
and x2
being linearly dependent. In other words, x2
is redundant, and the data can be described by the estimated linear model y = 3.4968 * x1 + epsilon
; this is obviously in good agreement with the theoretical coefficient x1 + 2.5 * x1 = 3.5 * x1
.