Search code examples
rcategories

How to collapse categories or recategorize variables?


In R, I have 600,000 categorical variables, each of which is classified as "0", "1", or "2".

What I would like to do is collapse "1" and "2" and leave "0" by itself, such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1". In the end I only want "0" and "1" as categories for each of the variables.

Also, if possible, I would rather not create 600,000 new variables, if I can replace the existing variables with the new values that would be great!

What would be the best way to do this?


Solution

  • There is a function recode in package car (Companion to Applied Regression):

    require("car")    
    recode(x, "c('1','2')='1'; else='0'")
    

    or for your case in plain R:

    > x <- factor(sample(c("0","1","2"), 10, replace=TRUE))
    > x
     [1] 1 1 1 0 1 0 2 0 1 0
    Levels: 0 1 2
    > factor(pmin(as.numeric(x), 2), labels=c("0","1"))
     [1] 1 1 1 0 1 0 1 0 1 0
    Levels: 0 1
    

    Update: To recode all categorical columns of a data frame tmp you can use the following

    recode_fun <- function(x) factor(pmin(as.numeric(x), 2), labels=c("0","1"))
    require("plyr")
    catcolwise(recode_fun)(tmp)