Search code examples
rdplyrforcats

Recode levels of multiple factors to specified range


I have the following data frame:

library(tidyverse)
df <- tibble(a = c(1, 2, 3, 4, 5),
             b = c("Y", "N", "N", "Y", "N"),
             c = c("A", "B", "C", "A", "B"))

df <- df %>%
  mutate_if(is.character, funs(as.factor))

The output of df:

      a b     c    
  <dbl> <fct> <fct>
1     1 Y     A    
2     2 N     B    
3     3 N     C    
4     4 Y     A    
5     5 N     B    

I would like to recode all factor (b and c variables) levels to integers: if a factor has only two levels it should be recoded to {0, 1}, otherwise to {1, 2, 3, ...} levels. So the output should be:

      a b     c    
  <dbl> <fct> <fct>
1     1 1     1    
2     2 0     2    
3     3 0     3    
4     4 1     1    
5     5 0     2    

I can recode variables separately (one by one), but I wonder if there is a more convenient approach.


Solution

  • One dplyr option could be:

    df %>%
     mutate(across(where(is.factor), 
                   ~ if(n_distinct(.) == 2) factor(., labels = 0:1) else factor(., labels = 1:n_distinct(.))))
    
          a b     c    
      <dbl> <fct> <fct>
    1     1 1     1    
    2     2 0     2    
    3     3 0     3    
    4     4 1     1    
    5     5 0     2