Search code examples
rdata-cleaning

Add multiple columns (TRUE/FALSE/NA) based on (not)matching a string with multiple-choice selections


Scenario: Suvery tools often put multiple choice questions’s answers as a single value with concatenated strings: If asking which programming languages from a list people use, they will generate "C;Python;R" if they selected these options; all of these values in a single columns representing this multiple choice question – but that is hard to work with.

Example data set: Stack Overflow Survey, 2003, file survey_results_public.csv e.g. column $LanguageHaveWorkedWith

Goal: Based on the column holding the strings with the selected answers and a list of possible answers, I try to…

  • Generate a column for each of the possible answers, e.g. ProgLang_C, ProgLang_Java
  • For e.g. "C;Python;R" the column ProgLang_C is TRUE, for ProgLang_Java it is FALSE; If there is NA, all values are NA, too (and not false) (so what I want is is what str_detect does)
  • The code should work with piping the dataframe; tidyverse preferred, but not essential

resulting dataframe example (not using the example dataset’s column names for brevity)

lang lang_java lang_C lang_python lang…
"C; python" FALSE TRUE TRUE

Notes:


Solution

  • You could do this by splitting the string and reframing it in the long direction by observation number. Then, you could pivot the language variable wider filling in missing values with FALSE.

    library(dplyr)
    library(stringr)
    library(tidyr)
    dat <- data.frame(lang = c("C;python", "R;python", "R;C;python"))
    dat %>% 
      mutate(obs = row_number()) %>% 
      group_by(obs) %>% 
      reframe(lang_orig = lang, lang = c(str_split(lang, pattern=";", simplify=TRUE))) %>% 
      mutate(exist = TRUE) %>% 
      pivot_wider(names_from = lang, names_prefix = "lang_", values_from= exist, values_fill = FALSE)
    #> # A tibble: 3 × 5
    #>     obs lang_orig  lang_C lang_python lang_R
    #>   <int> <chr>      <lgl>  <lgl>       <lgl> 
    #> 1     1 C;python   TRUE   TRUE        FALSE 
    #> 2     2 R;python   FALSE  TRUE        TRUE  
    #> 3     3 R;C;python TRUE   TRUE        TRUE
    

    Created on 2024-01-17 with reprex v2.0.2