Search code examples
rstringrstringi

using stringr::str_detect to detect if a string appears after a character have appeared 4 times


Not sure I worded my question all that well but its essentially what I am trying to do.

Data example:

Data <- c("NELIG_Q1_1_C1_A", "NELIG_N1_1_EG1_B", "NELIG_V2_1_NTH_C", "NELIG_Q2_1_C5_Q",
"NELIG_N1_1_C1_RA", "NELIG_Q1_1_EG1_QR", "NELIG_V2_1_NTH_PQ", "NELIG_N2_1_C5_PRQ")

I am wanting to filter using a str_detect on the last set of letter combinations. There will always be four " _ " before the string/pattern I am looking for is, but after the fourth " _ " there could be many different letter combinations. In the above example I am trying to detect only the letter "Q".

If I do a simple Data2 <- Data %>% filter(str_detect(column, "Q")) I would get all rows that have Q anywhere in the string. How can I tell it to focus on the last section only?


Solution

  • If the aim is to detect/match those strings that contain Qin the 'section' after the last _, then this works:

    grep("_[A-Z]*Q[A-Z]*$", Data, value = T, perl = T)
    [1] "NELIG_Q2_1_C5_Q"   "NELIG_Q1_1_EG1_QR" "NELIG_V2_1_NTH_PQ" "NELIG_N2_1_C5_PRQ"
    

    or, with str_detect:

    library(stringr)
    str_detect(Data, "_[A-Z]*Q[A-Z]*$")
    [1] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
    

    Data:

    Data <- c("NELIG_Q1_1_C1_A", "NELIG_N1_1_EG1_B", "NELIG_V2_1_NTH_C", "NELIG_Q2_1_C5_Q",
              "NELIG_N1_1_C1_RA", "NELIG_Q1_1_EG1_QR", "NELIG_V2_1_NTH_PQ", "NELIG_N2_1_C5_PRQ")