Search code examples
rstatisticsr-caret

missing values when creating training and testing data with caret


My question is about how handle missing values when using train for fitting models with caret. A small sample of my data would be like that:

       df <- dput(dat)
       structure(list(LagO3 = c(NA, NA, NA, 40, 45, NA), RH = c(69.4087524414062, 
       79.9608383178711, 64.4592437744141, 66.4207077026367, 66.0899200439453, 
       91.3353729248047), SR = c(298.928888888889, 300.128888888889, 
       303.688888888889, 304.521111111111, 303.223333333333, 294.716666666667
       ), ST = c(317.9917578125, 317.448253038194, 311.039059244792, 
       312.557927517361, 321.252841796875, 330.512212456597), Tmx = c(294.770359293045, 
       294.897191864461, 295.674552786042, 296.247345044048, 296.108238352818, 
       294.594430242372), CWTE = c(0, 1, 0, 0, 0, 0), CWTW = c(0, 0, 
       0, 0, 0, 0), o3 = c(NA, NA, NA, 52, 55, NA)), .Names = c("LagO3", 
       "RH", "SR", "ST", "Tmx", "CWTE", "CWTW", "o3"), row.names = c("1", 
       "2", "3", "4", "5", "6"), class = "data.frame")

The problem is that for several positions in one of my predictors I have NA and the predictand (o3) has also NA (but in different positions). Then, I tried:

model <- train(x = na.omit(x.training), y = na.omit(training$o3), method = "lmStepAIC",
               direction="backward", trControl = control)

But, I would have different length for y ... I tried to use:

 model <- train(x = x.training, y = training$o3,na.action=na.pass, 
                method = "lmStepAIC",direction="backward",trControl = control)

having the following error:

Error in quantile.default(y, probs = seq(0, 1, length = cuts)) : missing values and NaN's not allowed if 'na.rm' is FALSE

I would appreciate any suggestion!

Thanks a lot.


Solution

  • You need to use the na.action argument with na.omit of the train function. As the documentation says for na.action (type ?train):

    A function to specify the action to be taken if NAs are found. The default action is for the procedure to fail. An alternative is na.omit, which leads to rejection of cases with missing values on any required variable. (NOTE: If given, this argument must be named.)

    So the following will work:

    model <- train(x = x.training, y = training$o3, 
                  method = "lmStepAIC",direction="backward", 
                  trControl = control, na.action=na.omit)
    

    Output:

    > model <- train(x = x.training, y = y.training, method = "lmStepAIC",direction="backward",
    +                na.action=na.omit)
    Start:  AIC=-129.7
    .outcome ~ LagO3 + RH + SR + ST + Tmx + CWTE + CWTW
    
    
    Step:  AIC=-129.7
    .outcome ~ LagO3 + RH + SR + ST + Tmx + CWTE
    
    
    Step:  AIC=-129.7
    .outcome ~ LagO3 + RH + SR + ST + Tmx
    
    
    Step:  AIC=-129.7
    .outcome ~ LagO3 + RH + SR + ST
    
    
    Step:  AIC=-129.7
    .outcome ~ LagO3 + RH + SR
    
    
    Step:  AIC=-129.7
    .outcome ~ LagO3 + RH
    
    
    Step:  AIC=-129.7
    .outcome ~ LagO3
    
    
    Step:  AIC=-129.7
    .outcome ~ 1
    ...
    ...
    ...