Search code examples
rprefixstring-lengthstartswith

How to use startsWith and str_length simultaneously with multiple prefixes in R


I would like to use startsWith and str_length to identify the entries in the endpt_ds$DX1 that have start with the strings in dx9 and have a prefix of length greater than or equal to 3. This is what I've tried, but it returns a dataframe of zero rows. I would like it to return a dataframe with the 1st, 4th and 5th rows of the original dataframe:

dx9 = c(as.character(8:10))
DX1 <- c("8001","7","80","992","1010","93","400")
ind <- c(0,1,1,1,0,0,1)
yrMonth_ds = as.data.frame(cbind(DX1,ind))
yrMonth_ds$DX1 <- as.character(yrMonth_ds$DX1)
yrMonth_ds_endpt <- yrMonth_ds[which(startsWith(yrMonth_ds$DX1,paste0(dx9,collapse="|")) & str_length(yrMonth_ds$DX1 > 3)),]
yrMonth_ds_endpt

I would really appreciate any help. Thanks!


Solution

  • One option is to check the number of characters with nchar, create a logical expression with that, in addition use paste on the 'dx9' by collapsing it to a single pattern string with ^ to specify the start of the string and check with 'DX1' using grepl to return the rows that pass with both logic

    subset(yrMonth_ds, nchar(DX1) >=3  & 
         grepl(paste0("^(", paste(dx9, collapse="|"), ")"), DX1))
    #    DX1 ind
    #1 8001   0
    #4  992   1
    #5 1010   0