Search code examples
rdplyrforcats

Trim factor labels for multiple variables at the same time


This is my toy dataset:

 library(tidyverse)
dat <- tibble (x1 = c("False - very long label specific to x1", "False - very long label specific to x1", "True - very long label specific to x1", "True - very long label specific to x1"),
               x2 = c("False - very long label specific to x2", "False - very long label specific to x2", "False - very long label specific to x2", "True - very long label specific to x2"),
               y = c(10, 5, 12, 4)) %>% mutate_at(vars(x1:x2), factor)
head(dat)
#> # A tibble: 4 x 3
#>   x1                                x2                                    y
#>   <fct>                             <fct>                             <dbl>
#> 1 False - very long label specific~ False - very long label specific~    10
#> 2 False - very long label specific~ False - very long label specific~     5
#> 3 True - very long label specific ~ False - very long label specific~    12
#> 4 True - very long label specific ~ True - very long label specific ~     4

I would like to trim the very long factor labels, they all have two things in common:

  1. all start with True or False
  2. include the column name (ie the factor labels for each column are unique)

I would like to simplify this, and have only something like True and False for each factor column. This is my desired output:


#> # A tibble: 4 x 3
#>   x1    x2        y
#>   <fct> <fct> <dbl>
#> 1 False False    10
#> 2 False False     5
#> 3 True  False    12
#> 4 True  True      4

I think it should work with something like mutate_at and fct_relabel and maybe str_trunc, but I could not figure it out.


Solution

  • We can use trimws with whitespace

    library(dplyr)
    dat %>% 
        mutate_if(is.factor, ~ factor(trimws(., whitespace = "\\s*-.*")))
    # A tibble: 4 x 3
    #  x1    x2        y
    #  <fct> <fct> <dbl>
    #1 False False    10
    #2 False False     5
    #3 True  False    12
    #4 True  True      4
    

    Or with fct_relabel and str_remove

    library(forcats)
    library(stringr)
    dat %>% 
        mutate_if(is.factor, ~ fct_relabel(., ~str_remove(., '\\s*-.*')))
    

    Or using data.table

    library(data.table)
    m1 <- names(which(sapply(dat, is.factor)))
    setDT(dat)[, (nm1) := lapply(.SD, function(x) 
            factor(sub('\\s*-.*', "", x))) , .SDcols = nm1]