Search code examples
rtype-conversiondecision-tree

Error in converting categorical variables to factor in R


In this tutorial, I tried to use another method for converting categorical variables to factor.

In the article, the following method is used.

library(MASS)
library(rpart)
cols <- c('low', 'race', 'smoke', 'ht', 'ui')
birthwt[cols] <- lapply(birthwt[cols], as.factor)

and I replaced the last line by

birthwt[cols] <- as.factor((birthwt[cols]))

but the result is NA all

enter image description here

What is wrong with that?


Solution

  • as.factor((birthwt[cols])) is calling as.factor on a list of 5 vectors. If you do that R will interpret each of those 5 vectors as the levels, and the column headers as the labels, of a factor variable, which is clearly not what you want:

    > as.factor(birthwt[cols])
      low  race smoke    ht    ui 
     <NA>  <NA>  <NA>  <NA>  <NA> 
    5 Levels: c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ...
    > labels(as.factor(birthwt[cols]))
    [1] "low"   "race"  "smoke" "ht"    "ui" 
    

    lapply iterates over a list, calling the function as.factor on each of the vectors separately in that list. You need to do this to convert each variable separately into a factor, rather than attempting to convert the entire list into a single factor, which is what as.factor(birthwt[cols]) does.