Search code examples
rstringr

detecting the presence of a set of alphanumeric codes in a data frame


I have a dataframe (title: fy14y) with 100 variables (c1 - c100) containing alphanumeric codes of varying length (e.g. 1-S023; 2-Y0408)

What would be the best way to create a binary variable that detects if one or more of the codes from the following list is present in a row?

1-A400,1-A401, 1-A402, 1-A410, 1-A415, 1-A4152, 1-A4158, 1-B377, 1-P360, 1-P362, 1-P364, 1-U900

i.e. 0=if none of them appear, 1=if one or two or three etc. appear once or multiple times.

I've played around with the stringr str_detect function without much luck

thanks!


Solution

  • I think an intuitive way to do this is to put the data in long form, search for a match by id and then put it back in wide form. We can do this easily with tidyr and dplyr.

    Using the df and codes_to_check from the answer by Ronak Shah:

    df  |>
        pivot_longer(-id) |>
        mutate(
            row_match = +(any(codes_to_check %in% value)), .by = id
        ) |>
        pivot_wider()
    
    # # A tibble: 4 × 4
    #      id row_match c1     c2    
    #   <int>     <int> <chr>  <chr> 
    # 1     1         1 1-A400 a     
    # 2     2         0 a      b     
    # 3     3         1 b      1-A401
    # 4     4         0 c      d     
    

    This should be faster than iterating over rows.