Search code examples
rlogistic-regressionpredict

Using "-" to exclude some variables when regressing


Using the Weekly Dataset from ISLR pacakge on R:

> head(Weekly)
  Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down

Trying to use Logistic Regression to regress Direction on *all Lag variables and Volume*, and tried to use the "all except" shortcut on R to exlcude Year and Today:

logregall <- glm(Direction ~ . - Today - Year, 
                 family=binomial(link='logit'), data = Weekly)

But when I try to use this same object to make predictions, R somehow gives the error that I have forgotten to include Year in the 'newdata' dataframe despite not including Year:

dataforpred <- Weekly[,2:7]
preds <- predict(object = logregall, newdata = dataforpred, type = "response")

> preds <- predict(object = logregall, newdata = dataforpred, type = "response")
Error in eval(predvars, data, env) : object 'Year' not found

But when I regress by keying all variables manually, I get a fitted object that works for predict()

logregall2 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
                  family=binomial(link='logit'), data = Weekly)
preds <- predict(object = logregall2, newdata = dataforpred, type = "response")

> head(preds)
        1         2         3         4         5         6 
0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 

Why is this the case?


Solution

  • I don't have the package but I can replicate the error with mtcars dataset. I believe the reason is because you specified to remove some columns with -, so what the function does is to remove those columns first and then performs the prediction. It gets error out since it could not find those columns in the newdata.

    Therefore, the solution is to manually assign arbitrary values to the columns.

    fit <- glm(vs~. -mpg-cyl,data=mtcars, 
               family=binomial(link='logit'))
    
    dataforpred <- mtcars[,c(3:7,9:11)]
    
    preds <- predict(object = fit, newdata = dataforpred, type = "response")
    Error in eval(predvars, data, env) : object 'cyl' not found
    
    #solution
    dataforpred2 <- dataforpred%>%
      mutate(mpg=NA_real_,
             cyl=NA_real_)
    preds2 <- predict(object = fit, newdata = dataforpred2, type = "response")
    
    > preds2[1:5]
               1            2            3            4            5 
    2.220446e-16 1.081386e-11 1.000000e+00 1.000000e+00 2.220446e-16