Search code examples
rrefactoringcategorical-dataforcats

Automatic refactoring based on levels beginning with a certain character?


I'm looking for a method to automatically recode factors within a variable based on some pattern in the levels. I intend on iterating the solution to the larger data set.

I have a larger dataset that has multiple instances of the example I show below. The levels tend to have the following pattern:

The main categories are 1, 2, 3 and 4. Levels 11, 12, 13, and 14 are subcategories of level 1. I wish to be able to streamline the grouping process. I've successfully performed the refactoring using fct_recode, but my intent is to extend this procedure to other variables that follow a similar coding pattern.

library(tidyverse)

dat <- tribble(
  ~Ethnicity, 
  "1",
  "2",
  "3",
  "4",
  "11",
  "12",
  "13",
  "14",
  "11",
  "13",
  "12",
  "12",
  "11",
  "13")

dat <- mutate_at(dat, vars(Ethnicity), factor)

count(dat, Ethnicity)
#> # A tibble: 8 x 2
#>   Ethnicity     n
#>   <fct>     <int>
#> 1 1             1
#> 2 11            3
#> 3 12            3
#> 4 13            3
#> 5 14            1
#> 6 2             1
#> 7 3             1
#> 8 4             1

dat %>% 
  mutate(Ethnicity = fct_recode(Ethnicity,
                                "1" = "1",
                                "1" = "11",
                                "1" = "12",
                                "1" = "13",
                                "1" = "14"
                                )) %>% 
  count(Ethnicity)
#> # A tibble: 4 x 2
#>   Ethnicity     n
#>   <fct>     <int>
#> 1 1            11
#> 2 2             1
#> 3 3             1
#> 4 4             1

Created on 2019-05-31 by the reprex package (v0.2.1)

This method succesfully groups the subcategories of 11, 12, 13, and 14 into 1, as expected. Is there a way to do this without changing the levels manually for each subcategory? And what would be the general method of extending this process to several variables that have the same pattern? Thank you.


Solution

  • An option is to create a named vector and evaluate (!!!)

    library(dplyr)
    library(forcats)
    lvls <- levels(dat$Ethnicity)[substr(levels(dat$Ethnicity), 1, 1) == 1]
    nm1 <- setNames(lvls, rep(1, length(lvls)))
    dat %>% 
         mutate(Ethnicity = fct_recode(Ethnicity, !!!nm1)) %>% 
         count(Ethnicity)
    # A tibble: 4 x 2
    #  Ethnicity     n
    #  <fct>     <int>
    #1 1            11
    #2 2             1
    #3 3             1
    #4 4             1
    

    Or another option is to set the levels based on the substring

    levels(dat$Ethnicity)[substr(levels(dat$Ethnicity), 1, 1) == 1] <- 1
    dat %>% 
       count(Ethnicity)
    

    For multiple columns, use mutate_at and specify the variables of interest

    dat %>% 
        mutate_at(vars(colsOfInterst), list(~ fct_recode(., !!! nm1)))