I'm looking for a method to automatically recode factors within a variable based on some pattern in the levels. I intend on iterating the solution to the larger data set.
I have a larger dataset that has multiple instances of the example I show below. The levels tend to have the following pattern:
The main categories are 1, 2, 3 and 4. Levels 11, 12, 13, and 14 are subcategories of level 1. I wish to be able to streamline the grouping process. I've successfully performed the refactoring using fct_recode
, but my intent is to extend this procedure to other variables that follow a similar coding pattern.
library(tidyverse)
dat <- tribble(
~Ethnicity,
"1",
"2",
"3",
"4",
"11",
"12",
"13",
"14",
"11",
"13",
"12",
"12",
"11",
"13")
dat <- mutate_at(dat, vars(Ethnicity), factor)
count(dat, Ethnicity)
#> # A tibble: 8 x 2
#> Ethnicity n
#> <fct> <int>
#> 1 1 1
#> 2 11 3
#> 3 12 3
#> 4 13 3
#> 5 14 1
#> 6 2 1
#> 7 3 1
#> 8 4 1
dat %>%
mutate(Ethnicity = fct_recode(Ethnicity,
"1" = "1",
"1" = "11",
"1" = "12",
"1" = "13",
"1" = "14"
)) %>%
count(Ethnicity)
#> # A tibble: 4 x 2
#> Ethnicity n
#> <fct> <int>
#> 1 1 11
#> 2 2 1
#> 3 3 1
#> 4 4 1
Created on 2019-05-31 by the reprex package (v0.2.1)
This method succesfully groups the subcategories of 11, 12, 13, and 14 into 1, as expected. Is there a way to do this without changing the levels manually for each subcategory? And what would be the general method of extending this process to several variables that have the same pattern? Thank you.
An option is to create a named vector and evaluate (!!!
)
library(dplyr)
library(forcats)
lvls <- levels(dat$Ethnicity)[substr(levels(dat$Ethnicity), 1, 1) == 1]
nm1 <- setNames(lvls, rep(1, length(lvls)))
dat %>%
mutate(Ethnicity = fct_recode(Ethnicity, !!!nm1)) %>%
count(Ethnicity)
# A tibble: 4 x 2
# Ethnicity n
# <fct> <int>
#1 1 11
#2 2 1
#3 3 1
#4 4 1
Or another option is to set the levels
based on the substr
ing
levels(dat$Ethnicity)[substr(levels(dat$Ethnicity), 1, 1) == 1] <- 1
dat %>%
count(Ethnicity)
For multiple columns, use mutate_at
and specify the variables of interest
dat %>%
mutate_at(vars(colsOfInterst), list(~ fct_recode(., !!! nm1)))