Search code examples
rdplyrsubstrcase-when

dplyr: mutate from multiple areas/characters of a substring


A beginner here. Trying to use dplyr: mutate/case_when which has multiple substring conditions (10-character string). Each character represents an ethnicity. For example, a string with a "Y" in any of the 1-3 positions of the string and a "N" in the rest would be defined as "Latino". I am trying to get the correct code for a string which has a "Y" in any of the 1-3 positions but also a "Y" say in either of 4-5 positions ("Asian"). I want to define this as "Multi-Ethnic". Would appreciate in providing the correct code to create for a string which has an outcome of "Multi-Ethnic"? Many thanks for this site!

library(dplyr)

data = data.frame(APP_AC = c("YNNNNNNNNN",
                             "YYNNNNNNNN",
                             "YYYNNNNNNN",
                             "YNYNNNNNNN",
                             "NNNYNNNNNN",
                             "YNNYNNNNNN",
                             "NNNNNYNNNN",
                             "YNNNNYNNNY",
                             "NNNNNNNNNN"))

data %>% 
  mutate(ETHNICITY = case_when(
    str_sub(APP_AC,1,1) == "Y" ~ "Latino", 
    str_sub(APP_AC,2,2) == "Y" ~ "Latino",
    str_sub(APP_AC,3,3) == "Y" ~ "Latino",
    str_sub(APP_AC,4,4) == "Y" ~ "Asian",
    str_sub(APP_AC,5,5) == "Y" ~ "Asian",
    str_sub(APP_AC,6,6) == "Y" ~ "Black",
    str_sub(APP_AC,7,7) == "Y" ~ "Native_American_Alaskan",
    str_sub(APP_AC,8,8) == "Y" ~ "Pacific_Islander",
    str_sub(APP_AC,9,9) == "Y" ~ "Pacific_Islander",
    str_sub(APP_AC,10,10) == "Y" ~ "White",
    TRUE ~ "Unknown"))

    APP_AC     ETHNICITY
1   YNNNNNNNNN Latino
2   YYNNNNNNNN Latino
3   YYYNNNNNNN Latino
4   YNYNNNNNNN Latino
5   NNNYNNNNNN Asian
6   YNNYNNNNNN Asian
7   NNNNNYNNNN Black
8   YNNNNYNNNY Latino
9   NNNNNNNNNN Unknown

Desired output:

    APP_AC     ETHNICITY
1   YNNNNNNNNN Latino
2   YYNNNNNNNN Latino
3   YYYNNNNNNN Latino
4   YNYNNNNNNN Latino
5   NNNYNNNNNN Asian
6   YNNYNNNNNN Multi-Ethnic
7   NNNNNYNNNN Black
8   YNNNNYNNNY Multi-Ethnic
9   NNNNNNNNNN Unknown

Solution

  • Updated to include logic described in comments and updated question.

    CODE

    data %>% 
      mutate(ETHNICITY = case_when(
        str_detect(substr(APP_AC, 1, 3),"Y") & str_count(substr(APP_AC, 4, 10), "Y") == 0 ~ "Latino", 
        str_detect(substr(APP_AC, 1, 3),"Y") & str_count(substr(APP_AC, 4, 10), "Y") >= 1 ~ "Multi-Ethnic", 
        str_detect(substr(APP_AC, 4, 5),"Y")  ~ "Asian", 
        str_detect(substr(APP_AC, 6, 6),"Y") ~ "Black",
        str_detect(substr(APP_AC, 7, 7),"Y") ~ "Native_American_Alaskan",
        str_detect(substr(APP_AC, 8, 9),"Y") ~ "Pacific_Islander",
        str_detect(substr(APP_AC, 10, 10),"Y") ~ "White",
        TRUE ~ "Unknown")
      )
    

    OUTPUT

          APP_AC    ETHNICITY
    1 YNNNNNNNNN       Latino
    2 YYNNNNNNNN       Latino
    3 YYYNNNNNNN       Latino
    4 YNYNNNNNNN       Latino
    5 NNNYNNNNNN        Asian
    6 YNNYNNNNNN Multi-Ethnic
    7 NNNNNYNNNN        Black
    8 YNNNNYNNNY Multi-Ethnic
    9 NNNNNNNNNN      Unknown