I have a pretty simply logistic regression model based solely on two categorical predictors in Race
and Sex
. Firstly, since I have some missing values, to make sure all the missing data comes in as NA
, I import the data frame using the following:
> mydata <- read.csv("~/Desktop/R/mydata.csv", sep=",", strip.white = TRUE,
+ na.strings= c("999", "NA", " ", ""))
Here's the summary of the predictors to see how many NA
s there are:
> # Define variables
>
> Y <- cbind(Support)
> X <- cbind(Race, Sex)
>
> summary(X)
Race Sex
Min. :1.000000 Min. :1.000000
1st Qu.:1.000000 1st Qu.:1.000000
Median :2.000000 Median :1.000000
Mean :1.608696 Mean :1.318245
3rd Qu.:2.000000 3rd Qu.:2.000000
Max. :3.000000 Max. :3.000000
NA's :420 NA's :42
The model seems to do what it's supposed to with no problems due to the missing values:
> # Logit model coefficients
>
> logit <- glm(Y ~ X, family=binomial (link = "logit"))
>
> summary(logit)
Call:
glm(formula = Y ~ X, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0826825 -1.0911146 0.6473451 1.0190080 1.7457212
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3457629 0.2884629 4.66529 3.0818e-06 ***
XRace -1.0716191 0.1339177 -8.00207 1.2235e-15 ***
XSex 0.5910812 0.1420270 4.16175 3.1581e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1434.5361 on 1057 degrees of freedom
Residual deviance: 1347.5684 on 1055 degrees of freedom
(420 observations deleted due to missingness)
AIC: 1353.5684
Number of Fisher Scoring iterations: 4
Question 1: When I don't have any NA
s, this code seems to work well. But I get an error message whenever there are missing values. Is there a way to still see how many correctly predicted values I have, regardless of missing data or not?
> table(true = Y, pred = round(fitted(logit)))
Error in table(true = Y, pred = round(fitted(logit))) :
all arguments must have the same length
Edit: After adding na.action = na.exclude
to the model definition, the table now works perfectly:
pred
true 0 1
0 259 178
1 208 413
Something that does still work, regardless of missing data, is loading the predictions onto the original data frame when I use this code. It correctly adds a 'pred' column at the end of the data frame with each row's probability (and simply adds an NA
instead if one of the predictors does not exist):
> predictions = cbind(mydata, pred = predict(logit, newdata = mydata, type = "response"))
> write.csv(predictions, "~/Desktop/R/predictions.csv", row.names = F)
Question 2: However, when I try to predict into a new data frame -- even though it has the same variables of interest -- it seems like something about the missing values cause an error message as well. Is there code to get around this, or am I doing something incorrectly?
> newpredictions = cbind(newdata, pred = predict(logit, newdata = newdata, type = "response"))
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 1475, 1478
In addition: Warning message:
'newdata' had 1475 rows but variables found have 1478 rows
As you can see above, the number of rows in mydata
is 1,478 and the number of rows in newdata
is 1,475.
Thanks for the help!
If you have missing data, NA
s, R will strip these when the modelling functions does formula
-> model.frame
-> model.matrix()
etc., because the default in all these functions is to have na.action = na.omit
. In other words, rows with NA
s are deleted before the actual computations are performed. This deletion propagates through to the fitted values, residuals etc that are accessed from the model object
Realising this is an issue, R has other na.action
functions, including na.exclude()
. Hence if you add
na.action = na.exclude
to your call to glm()
, fitted()
, resid()
, etc would return as many fitted values as you have rows in your put data.
You do seem to be going about modelling in a peculiar way. Why are you creating X
and Y
, presumably from your mydata
object? It would be far better to do
mod <- glm(Support ~ Race + Sex, data = mydata family = binomial,
na.action = na.exclude)
where now instead of the anonymous X
and Y
we have variables that mean something, and you haven't had to create duplicate data.