Search code examples
rmachine-learningc5.0

Error: *** line 1 of `undefined.cases': bad value of ... for attribute


I'm training a decision tree, C5.0, and everything runs just fine until I try to predict values in the test dataset. I am not sure what the error means:

library(pacman)
p_load(tidyverse, NHANES, C50)

rows <- sample(nrow(NHANES), as.integer(0.75 * nrow(NHANES)))

nhanes_train <- NHANES[rows,] %>%
  select(SleepTrouble, everything(), -ID)
nhanes_test <- NHANES[-rows,] %>%
  select(SleepTrouble, everything(), -ID)

nhanes_tree <- C5.0(nhanes_train[-1], nhanes_train$SleepTrouble)

nhanes_tree_pred <- predict(nhanes_tree, nhanes_test)

Output:

Error: *** line 1 of undefined.cases': bad value of c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,' for attribute `SurveyYr' Error limit exceeded


Solution

  • It seems that when you have non-numeric data such as factors, you have to use the formula version of the function. This works fine:

    nhanes_tree <- C5.0(SleepTrouble ~ ., nhanes_train)
    nhanes_tree_pred <- predict(nhanes_tree, nhanes_test)
    

    From the documentation:

    When using the formula method, factors and other classes are preserved (i.e. dummy variables are not automatically created). This particular model handles non-numeric data of some types (such as character, factor and ordered data).