Search code examples
rdplyrforcats

Collapse, order and drop factors efficiently in dplyr


Subsetting a large dataframe leaves us with a factor variable that needs reordering and dropping of missing factors. A reprex is below:

library(tidyverse)

set.seed(1234)

data <- c("10th Std. Pass", "11th Std. Pass", "12th Std. Pass", "5th Std. Pass", 
          "6th Std. Pass", "Diploma / certificate course", "Graduate", "No Education")

education <-  factor(sample(data, size = 5, replace = TRUE), 
                     levels = c(data, "Data not available"))

survey <-  tibble(education)

The code further below, as per this answer, achieves what we want but we'd like to integrate the reordering and dropping of factors into our piped recoding of the survey.

recoded_s <- survey %>% mutate(education =
  fct_collapse(education,
"None" = "No Education",
"Primary" = c("5th Std. Pass", "6th Std. Pass"),
"Secondary" = c("10th Std. Pass", "11th Std. Pass", "12th Std. Pass"), 
"Tertiary" = c("Diploma / certificate course", "Graduate")
  ))

recoded_s$education
#> [1] Secondary Primary   Primary   Primary   Tertiary 
#> Levels: Secondary Primary Tertiary None Data not available


# Re-ordering and dropping variables
factor(recoded_s$education, levels = c("None", "Primary", "Secondary", "Tertiary"))
#> [1] Secondary Primary   Primary   Primary   Tertiary 
#> Levels: None Primary Secondary Tertiary

Any pointers would be much appreciated!


Solution

  • I'm not sure I understand. Could you elaborate why wrapping everything inside a mutate call doesn't suffice?

    library(tidyverse)
    library(forcats)
    survey %>%
        mutate(
            education = fct_collapse(
                education,
                "None" = "No Education",
                "Primary" = c("5th Std. Pass", "6th Std. Pass"),
                "Secondary" = c("10th Std. Pass", "11th Std. Pass", "12th Std. Pass"),
                "Tertiary" = c("Diploma / certificate course", "Graduate")),
            education = factor(education, levels = c("None", "Primary", "Secondary", "Tertiary")))
    

    Alternative using dplyr::recode

    lvls <- list(
        "No Education" = "None",
        "5th Std. Pass" = "Primary",
        "6th Std. Pass" = "Primary",
        "10th Std. Pass" = "Secondary",
        "11th Std. Pass" = "Secondary",
        "12th Std. Pass" = "Secondary",
        "Diploma / certificate course" = "Tertiary",
        "Graduate" = "Tertiary")
    survey %>%
        mutate(
            education = factor(recode(education, !!!lvls), unique(map_chr(lvls, 1))))