Search code examples
rdplyracross

Create indicator variable to detect string in any column that begins with word in R


I have a dataset with many columns. I am interested in the columns that contain "dx_" within the variable name. I would like to create an indicator variable that is 1 in every row where at least one of the columns' whose names contain "dx_" has a value that starts with "493". For example:

df = data.frame(var1 = c(1,2,3,4,5),var2 = c(5,4,3,2,1),dx_1 = c("493","XH","1493","4938B","LP23"),dx_2 = c("AB","0PC3","MNP","12GT","FPN2"),a_dx_3 = c("FTR","2RTN","92KS","J294","493V"))

> df
  var1 var2  dx_1 dx_2 a_dx_3
1    1    5   493   AB    FTR
2    2    4    XH 0PC3   2RTN
3    3    3  1493  MNP   92KS
4    4    2 4938B 12GT   J294
5    5    1  LP23 FPN2   493V

I would like to create a new variable, Z, that is 1 if any of dx_1, dx_2, or a_dx_3 have a value that starts with "493" in that row, or 0 otherwise. However, I need the solution to be flexible so can I don't have to specify which columns beyond saying contains("dx_")

I would like my answer to look like this:

  var1 var2  dx_1 dx_2 a_dx_3 Z
1    1    5   493   AB    FTR 1
2    2    4    XH 0PC3   2RTN 0
3    3    3  1493  MNP   92KS 0
4    4    2 4938B 12GT   J294 1
5    5    1  LP23 FPN2   493V 1

This is my failed attempt: First I create a helper function to recognize the string:

detect_493_fn <- function(str){
  ans = if_else(str_starts(str,"493") == TRUE,
                1,
                0)
  return(ans)
} 

And then use a combination of if_any, across, and contains:

ans <- df %>%
  mutate(Z = case_when(
    if_any(across(contains("dx_"), ~detect_493_fn(.))) ~ 1,
    TRUE ~ 0))

but I get this error:

Error in `mutate()`:
! Problem while computing `Z = case_when(...)`.
Caused by error in `if_any()`:
! Must subset columns with a valid subscript vector.
x Subscript has the wrong type `tbl_df<
  dx_1  : double
  dx_2  : double
  a_dx_3: double
>`.
i It must be numeric or character.

I would be so grateful if someone could help me. Thanks!


Solution

  • You can do:

    library(dplyr)
    library(stringr)
    
    df %>%
      mutate(Z = as.numeric(if_any(contains("dx"), str_starts, "493")))
    
      var1 var2  dx_1 dx_2 a_dx_3 Z
    1    1    5   493   AB    FTR 1
    2    2    4    XH 0PC3   2RTN 0
    3    3    3  1493  MNP   92KS 0
    4    4    2 4938B 12GT   J294 1
    5    5    1  LP23 FPN2   493V 1
    

    Consider keeping your Z variable as logical. if_any() is an across() variant so you use it in place of across() not with it.