Scenario: Suvery tools often put multiple choice questions’s answers as a single value with concatenated strings: If asking which programming languages from a list people use, they will generate "C;Python;R" if they selected these options; all of these values in a single columns representing this multiple choice question – but that is hard to work with.
Example data set: Stack Overflow Survey, 2003, file survey_results_public.csv
e.g. column $LanguageHaveWorkedWith
Goal: Based on the column holding the strings with the selected answers and a list of possible answers, I try to…
ProgLang_C
, ProgLang_Java
…ProgLang_C
is TRUE, for ProgLang_Java
it is FALSE; If there is NA, all values are NA, too (and not false) (so what I want is is what str_detect does)resulting dataframe example (not using the example dataset’s column names for brevity)
lang | lang_java | lang_C | lang_python | lang… |
---|---|---|---|---|
"C; python" | FALSE | TRUE | TRUE | … |
… | … | … | … | … |
Notes:
You could do this by splitting the string and reframing it in the long direction by observation number. Then, you could pivot the language variable wider filling in missing values with FALSE
.
library(dplyr)
library(stringr)
library(tidyr)
dat <- data.frame(lang = c("C;python", "R;python", "R;C;python"))
dat %>%
mutate(obs = row_number()) %>%
group_by(obs) %>%
reframe(lang_orig = lang, lang = c(str_split(lang, pattern=";", simplify=TRUE))) %>%
mutate(exist = TRUE) %>%
pivot_wider(names_from = lang, names_prefix = "lang_", values_from= exist, values_fill = FALSE)
#> # A tibble: 3 × 5
#> obs lang_orig lang_C lang_python lang_R
#> <int> <chr> <lgl> <lgl> <lgl>
#> 1 1 C;python TRUE TRUE FALSE
#> 2 2 R;python FALSE TRUE TRUE
#> 3 3 R;C;python TRUE TRUE TRUE
Created on 2024-01-17 with reprex v2.0.2