I have a categorical variable that I want to convert into dummies for a classification task. the problem is that some of the levels appear just a few times, thus they create problems of perfect-multicollinearity when I split my sample in training set and testing set.
How can I get rid of these levels in a quick and elegant way? Here is a simple example of the my data:
label var_x
1 1
0 2
1 1
0 3
1 2
0 4
0 5
1 5
1 1
....
Let's say that I want to keep only the levels that appear more than 1 (or any other number) I want to recode as "0" those cases and obtain something like this:
label var_x
1 1
0 2
1 1
0 0
1 2
0 0
0 5
1 5
1 1
....
Thank you for your help
One dplyr
option could be:
df %>%
add_count(var_x) %>%
mutate(var_x = as.numeric(n > 1)*var_x) %>%
select(-n)
label var_x
<int> <dbl>
1 1 1
2 0 2
3 1 1
4 0 0
5 1 2
6 0 0
7 0 5
8 1 5
9 1 1
And the same idea with base R
:
as.numeric(with(data.frame(table(df$var_x)), Freq[match(df$var, Var1)]) > 1)*df$var_x