Search code examples
rdplyrtidyversevectorizationdummy-variable

How do I efficiently recode groups of dummies conditional on one dummy?


I'm trying to recode several dummy variables at once but am struggling to come up with a functioning vectorized solution (alternatively a for loop).

reprex:

library(tidyverse)
library(magrittr)
library(dummies)
library(janitor)

df_raw <- data.frame(
  species = as.factor(c("cat", "dog", NA, "dog", "dog")),
  weight = rnorm(5, mean = 5, sd = 1),
  sex = as.factor(c("m", NA, "f", "f", "m"))
)

df_raw

  species   weight  sex
1     cat 3.025896    m
2     dog 3.223064 <NA>
3    <NA> 5.230367    f
4     dog 4.231511    f
5     dog 5.819032    m


I split the factor variables (species and sex) into dummies but the NA get their own indicators (species_na and sex_na)

df_dummy <- dummies::dummy.data.frame(df_raw,
                                      dummy.classes = "factor",
                                      sep = "_",
                                      omit.constants = TRUE,
                                      all = TRUE) %>% 
  janitor::clean_names()

  species_cat species_dog species_na   weight sex_f sex_m sex_na
1           1           0          0 3.025896     0     1      0
2           0           1          0 3.223064     0     0      1
3           0           0          1 5.230367     1     0      0
4           0           1          0 4.231511     1     0      0
5           0           1          0 5.819032     0     1      0

My problem: how do I efficiently recode all of the factor dummies ("indexed" by the prefix, e.g. species_) to NA conditional on value of the _na dummy in the respective group of dummies? In other words, I need to mutate all dummies with the prefix species_ as NA whenever the species_na == 1 etc.

I have come up with the solution below, but I haven't been able to generalize the last step to the entire dataset

factor_vars <- dplyr::select_if(df_raw, is.factor) %>% colnames()
na_labs <- paste(factor_vars,
                 "na",
                 sep = "_")

df_dummy <- df_dummy %>%
  dplyr::mutate(across(all_of(na_labs),
                       .fns = list(var = ~ . == 1),
                       .names = "{fn}_{col}" ))  

# --- trial run for one variable only
test <- df_dummy %>% 
  mutate(species_cat = ifelse(var_species_na == TRUE,
                              NA,
                              species_cat))

Any help is appreciated!


Solution

  • You can try -

    library(dplyr)
    library(purrr)
    
    df_dummy <- dummies::dummy.data.frame(df_raw,
                                          dummy.classes = "factor",
                                          sep = "_",
                                          omit.constants = TRUE,
                                          all = TRUE) %>% 
      janitor::clean_names()
    
    factor_vars <- dplyr::select_if(df_raw, is.factor) %>% colnames()
    na_labs <- paste(factor_vars,
                     "na",
                     sep = "_")
    
    
    map_dfc(factor_vars, ~df_dummy %>%
          select(contains(.x)) %>%
          mutate(across(.fns = ~ifelse(.data[[paste0(.x, '_na')]] == 1, NA, .))))
    
    #  species_cat species_dog species_na sex_f sex_m sex_na
    #1           1           0          0     0     1      0
    #2           0           1          0    NA    NA     NA
    #3          NA          NA         NA     1     0      0
    #4           0           1          0     1     0      0
    #5           0           1          0     0     1      0