Search code examples
rdplyrstringr

How to separate value of column to become column?


My data frame is so dirty, it put my interesting variable together.

my df like it : enter image description here

      eid                                edu
1 1009467 "A levels/AS levels or equivalent"
2 1016906 "A levels/AS levels or equivalent"
3 1018742 "A levels/AS levels or equivalent"
4 1030778 "A levels/AS levels or equivalent","CSEs or equivalent"
5 1030785 "A levels/AS levels or equivalent","CSEs or equivalent"

or you can copy it:

structure(list(n = 399:401, edu = c("\"A levels/AS levels or equivalent\",\"College or University degree\",\"CSEs or equivalent\",\"NVQ or HND or HNC or equivalent\"", 
"\"A levels/AS levels or equivalent\",\"College or University degree\",\"CSEs or equivalent\",\"NVQ or HND or HNC or equivalent\",\"O levels/GCSEs or equivalent\"", 
"\"A levels/AS levels or equivalent\",\"College or University degree\",\"CSEs or equivalent\",\"NVQ or HND or HNC or equivalent\",\"O levels/GCSEs or equivalent\""
)), row.names = c(NA, 3L), class = "data. Frame")

and it may be contain 8 education level options:

"A levels/AS levels or equivalent",
"College or University degree",
"CSEs or equivalent",
"NVQ or HND or HNC or equivalent",
"O levels/GCSEs or equivalent",
"Other professional qualifications eg: nursing, teaching",
"Prefer not to answer",
"None of the above"

The dirtiest thing is that it may appear at the same time, the first five are selected together.

I want to separate my column of edu according their value to become new variable, and if edu contain it , it will display 1, if not, it will display 0

like it:

      eid                                edu 
1 1009467 "A levels/AS levels or equivalent"           
2 1016906 "A levels/AS levels or equivalent"           
3 1018742 "A levels/AS levels or equivalent"           
4 1030778 "A levels/AS levels or equivalent","CSEs or equivalent"        
5 1043561 "A levels/AS levels or equivalent","CSEs or equivalent"       

A levels CSEs
1        0
1        0
1        0
1        1
1        1

Thank you!


Solution

  • Try this:

    library(dplyr) #>= 1.1.0
    library(tidyr)
    
    df %>% 
      separate_rows(edu, sep = ",") %>% 
      mutate(edu =  gsub("^\"|\"$", "",edu)) %>% 
      mutate(A_levels = ifelse(edu %in% "A levels/AS levels or equivalent", 1, 0),
             College_Uni = ifelse(edu %in% "College or University degree", 1, 0),
             CSEs = ifelse(edu %in% "CSEs or equivalent", 1, 0),
             NVQ_HND_HNC = ifelse(edu %in% "NVQ or HND or HNC or equivalent", 1, 0),
             O_levels_GCSEs = ifelse(edu %in% "O levels/GCSEs or equivalent", 1, 0),
             Other_prof_qual = ifelse(edu %in% "Other professional qualifications eg: nursing, teaching", 1, 0),
             Prefer_not_to_answer = ifelse(edu %in% "Prefer not to answer", 1, 0),
             None_of_the_above = ifelse(edu %in% "None of the above", 1, 0)) %>% 
      summarise(across(everything(), ~max(.)), .by = eid)
    
     eid edu                              A_levels Colleg…¹  CSEs NVQ_H…² O_lev…³ Other…⁴ Prefe…⁵ None_…⁶
        <int> <chr>                               <dbl>    <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
    1 1009467 A levels/AS levels or equivalent        1        0     0       0       0       0       0       0
    2 1016906 A levels/AS levels or equivalent        1        0     0       0       0       0       0       0
    3 1018742 A levels/AS levels or equivalent        1        0     0       0       0       0       0       0
    4 1030778 CSEs or equivalent                      1        0     1       0       0       0       0       0
    5 1030785 CSEs or equivalent                      1        0     1       0       0       0       0       0
    # … with abbreviated variable names ¹​College_Uni, ²​NVQ_HND_HNC, ³​O_levels_GCSEs, ⁴​Other_prof_qual,
    #   ⁵​Prefer_not_to_answer, ⁶​None_of_the_above