Search code examples
rdataframedummy-variable

convert low frequent levels of a categorical variable in "others" in R


I have a categorical variable that I want to convert into dummies for a classification task. the problem is that some of the levels appear just a few times, thus they create problems of perfect-multicollinearity when I split my sample in training set and testing set.

How can I get rid of these levels in a quick and elegant way? Here is a simple example of the my data:

label   var_x
 1        1
 0        2
 1        1
 0        3
 1        2
 0        4
 0        5
 1        5
 1        1
 ....

Let's say that I want to keep only the levels that appear more than 1 (or any other number) I want to recode as "0" those cases and obtain something like this:

label   var_x
 1        1
 0        2
 1        1
 0        0
 1        2
 0        0
 0        5
 1        5
 1        1
 ....

Thank you for your help


Solution

  • One dplyr option could be:

    df %>%
     add_count(var_x) %>%
     mutate(var_x = as.numeric(n > 1)*var_x) %>%
     select(-n)
    
      label var_x
      <int> <dbl>
    1     1     1
    2     0     2
    3     1     1
    4     0     0
    5     1     2
    6     0     0
    7     0     5
    8     1     5
    9     1     1
    

    And the same idea with base R:

    as.numeric(with(data.frame(table(df$var_x)), Freq[match(df$var, Var1)]) > 1)*df$var_x