How to recode a numeric variable with many values including missing values, to have numbers 0:n-1
where n
is the number of unique values including NA
, in a tidy way?
Example:
df <- tibble(x = c(1000, 1000, NA, 1001, 1002, 1003, NA, 1003))
Required output (though it can be any recoding scheme as long as the values are 0:n-1
):
# A tibble: 8 x 2
x y
<dbl> <dbl>
1 1000 0
2 1000 0
3 NA 4
4 1001 1
5 1002 2
6 1003 3
7 NA 4
8 1003 3
I am able to do this in a tidy way by first converting x
to a factor then using fct_recode()
with a named list (created automatically because remember there are many values in x
), then back to numeric:
df <- df %>% mutate(x_fct = factor(case_when(
is.na(x) ~ "level_na",
TRUE ~ str_c("level_", x)
)))
x_levels <- levels(df$x_fct)
n_levels <- length(x_levels)
names(x_levels) <- as.character(0:(n_levels - 1))
df <- df %>%
mutate(y = as.numeric(fct_recode(x_fct, !!!x_levels)) - 1)
df
# A tibble: 8 x 3
x x_fct y
<dbl> <fct> <dbl>
1 1000 level_1000 0
2 1000 level_1000 0
3 NA level_na 4
4 1001 level_1001 1
5 1002 level_1002 2
6 1003 level_1003 3
7 NA level_na 4
8 1003 level_1003 3
But this seems very cumbersome. Surely there's a simpler way, preferably in a single pipe.
One way would be to use match
+ unique
. You can add sample
to add randomness
library(dplyr)
df %>%
mutate(level = paste('level', x, sep = '_'),
y = match(x, sample(unique(x))) - 1)
# x level y
# <dbl> <chr> <dbl>
#1 1000 level_1000 4
#2 1000 level_1000 4
#3 NA level_NA 2
#4 1001 level_1001 0
#5 1002 level_1002 1
#6 1003 level_1003 3
#7 NA level_NA 2
#8 1003 level_1003 3