Search code examples
rtidyversedata-munging

data cleaning - conversion to tidyverse


I am curious if the following code could be converted to tidyverse code. I've tried dplyr::mutate and haven't been able to get it to work quite right.

df$Gender[df$Gender == "M"] <- "Man"
df$Gender[df$Gender == "Male"] <- "Man"
df$Gender[df$Gender == "F"] <- "Woman"
df$Gender[df$Gender == "Female"] <- "Woman"
df$Gender[df$Gender == "M & F"] <- "Man and Woman"
df$Gender[df$Gender == "Male & Female"] <- "Man and Woman"

Solution

  • Here's one way, with dplyr::case_when():

    df$Gender <- dplyr::case_when(
      df$Gender %in% c("M", "Male") ~ "Man", 
      df$Gender %in% c("F", "Female") ~ "Woman",
      df$Gender %in% c("M & F", "Male & Female") ~ "Man and Woman",
      TRUE ~ NA_character_)
    

    Or, if you want to use the typical dplyr::/magrittr:: pipe-chain approach:

    df <- df %>% mutate(Gender = case_when(
      Gender %in% c("M", "Male") ~ "Man", 
      Gender %in% c("F", "Female") ~ "Woman",
      Gender %in% c("M & F", "Male & Female") ~ "Man and Woman",
      TRUE ~ NA_character_))
    

    And finally, a tip: when there's a lot of unique values you need to group, using case_when() (or nested ifelse()s, or subsetted assignment, etc.) can get pretty tedious. One way to avoid much of the pain is to use named vectors to replace each value with a dictionary-style "lookup table" (terminology informal -- see wiki on "associative array" for some background). In my experience this usually feels the cleanest:

    # the unique values 
    gender_values <- c("M","Man","Male","F","Woman","Female","MF","male-female")
    
    # associate unique values with our new labels: "m", "f", and "b"
    gender_lkup <- setNames(c("m","m","m","f","f","f","b","b"), gender_values)
    
    # suppose this is a column of a df 
    raw_column <- sample(gender_values, 10, replace=TRUE)
    
    # create a clean one with `gender_lkup` 
    clean_column <- gender_lkup[raw_column]
    
    # inspect the two vectors side-by-side
    data.frame(original=raw_column, cleaned=clean_column)