Search code examples
rmachine-learningglm

GLM: Warning message: 'newdata' had 16623 rows but variables found have 22488 rows


I have scoured the forum far and wide and found many articles like this, however, none that solved my issue.

Now, I turn to you.

I have data similar to this:

ontime currency incoterms price month
1      USD      FOB       234.2    01
1      CAD      FOB        92.4    01
0      USD      DAP       238.9    02
0      EUR      FOB       100      03
1      CNY      DAP       739.8    04

I this code:

g = df$ontime      #binary
a = df$currency    #String
b = df$INCOTERMS   #String
c = df$price       #float
f = df$month       #string

mod1 <- glm(g~a+b+c,family=binomial(link="logit"), data=df[f=="01",])
pred_ontime1 <- predict(mod1,df[f%in%c("02","03","04"),],type="response")

My desire is to test my model, that I trained on data from month 01, on month 02, 03 and 04.

My outcome, however is this:

Warning message:
'newdata' had 16623 rows but variables found have 22488 rows

I have tried training on month 01 and testing on 01,02,03 and 04, which did not give me the error message, however, it seems inappropriate to test on data included in my training set.

The value 16623 is of course the combined number of rows in 02, 03 and 04, while 22488 is the combined number of rows in 01, 02, 03 and 04.

What can I do?


Solution

  • Try running the model without saving each column to a vector first. I think predict() can't tell that it is the same variable names as it modeled on.

    mod1 <- glm(ontime ~ currency + INCOTERMS + price, family = binomial(link = "logit"), data = df[df$month == "01",])
    pred_ontime1 <- predict(mod1,df[df$month %in% c("02","03","04"),], type = "response")
    

    See if that works.


    Here is a reproducible example for anyone interested:

    df <- read.table(textConnection("ontime currency incoterms price month
    0      USD      DAP       234.2    01
                              1      CAD      FOB        92.4    01
                              0      USD      DAP       238.9    02
                              0      USD      FOB       100      03
                              1      CAD      DAP       739.8    04"), header = TRUE)
    
    mod1 <- glm(ontime ~ currency + incoterms + price, family = binomial(link = "logit"), data = df[df$month == 1,])
    pred_ontime1 <- predict(mod1, df[df$month %in% c(2:4),], type = "response")
    pred_ontime1
               3            4            5 
    5.826215e-11 5.826215e-11 1.000000e+00