Search code examples
rdplyrrecodeforcats

Recode a numeric variable with many values including NAs


How to recode a numeric variable with many values including missing values, to have numbers 0:n-1 where n is the number of unique values including NA, in a tidy way?

Example:

df <- tibble(x = c(1000, 1000, NA, 1001, 1002, 1003, NA, 1003))

Required output (though it can be any recoding scheme as long as the values are 0:n-1):

# A tibble: 8 x 2
      x     y
  <dbl> <dbl>
1  1000     0
2  1000     0
3    NA     4
4  1001     1
5  1002     2
6  1003     3
7    NA     4
8  1003     3

I am able to do this in a tidy way by first converting x to a factor then using fct_recode() with a named list (created automatically because remember there are many values in x), then back to numeric:

df <- df %>% mutate(x_fct = factor(case_when(
    is.na(x) ~ "level_na",
    TRUE ~ str_c("level_", x)
  )))
x_levels <- levels(df$x_fct)
n_levels <- length(x_levels)
names(x_levels) <- as.character(0:(n_levels - 1))
df <- df %>%
  mutate(y = as.numeric(fct_recode(x_fct, !!!x_levels)) - 1)
df
# A tibble: 8 x 3
      x x_fct          y
  <dbl> <fct>      <dbl>
1  1000 level_1000     0
2  1000 level_1000     0
3    NA level_na       4
4  1001 level_1001     1
5  1002 level_1002     2
6  1003 level_1003     3
7    NA level_na       4
8  1003 level_1003     3

But this seems very cumbersome. Surely there's a simpler way, preferably in a single pipe.


Solution

  • One way would be to use match + unique. You can add sample to add randomness

    library(dplyr)
    
    df %>%
      mutate(level = paste('level', x, sep = '_'), 
             y = match(x, sample(unique(x))) - 1)
    
    #      x level          y
    #  <dbl> <chr>      <dbl>
    #1  1000 level_1000     4
    #2  1000 level_1000     4
    #3    NA level_NA       2
    #4  1001 level_1001     0
    #5  1002 level_1002     1
    #6  1003 level_1003     3
    #7    NA level_NA       2
    #8  1003 level_1003     3