Search code examples
rregexstringstringrstringi

matching strings regex exact match - special characters


Following on from a solved thread here: matching strings regex exact match (with a bit thank-you to @Onyambu for the updated code).

I need to match strings exactly - even if there are special characters.

Note - apologies this is the third question on this issue. I am nearly there but now I don't know how to handle special characters and I am still upskilling on manipulating strings in r.

UPDATED FOR CLARITY:

I have a table of match words / strings like this:

codes <- structure(
  list(
    column1 = structure(
      c(2L, 3L, NA),
      .Label = c("",
                 "4+", "4 +"),
      class = "factor"
    ),
    column2 = structure(
      c(1L,
        3L, 2L),
      .Label = c("old", "the money", "work"),
      class = "factor"
    ),
    column3 = structure(
      c(3L, 2L, NA),
      .Label = c("", "wonderyears",
                 "woke"),
      class = "factor"
    )
  ),
  row.names = c(NA,-3L),
  class = "data.frame"
)

And a dataset that has a column of strings. I want to see if any of the codes are included in each of the records in strings:

strings<- structure(
  list(
    SurveyID = structure(
      1:4,
      .Label = c("ID_1", "ID_2",
                 "ID_3", "ID_4"),
      class = "factor"
    ),
    Open_comments = structure(
      c(2L,
        4L, 3L, 1L),
      .Label = c(
        "I need to pick up some apples",
        "The system works",
        "Flag only if there is a 4 with a plus",
        "Show me the money"
      ),
      class = "factor"
    )
  ),
  class = "data.frame",
  row.names = c(NA,-4L)
)

I am currently matching the codes to the strings using the following code:

strings[names(codes)] <- lapply(codes, function(x) 
  +(grepl(paste0("\\b", na.omit(x), "\\b", collapse = "|"), strings$Open_comments)))

Output:

  SurveyID                         Open_comments column1 column2 column3
1     ID_1                      The system works       0       0       0
2     ID_2                     Show me the money       0       1       0
3     ID_3 Flag only if there is a 4 with a plus       1       0       0
4     ID_4         I need to pick up some apples       0       0       0

Issue - Row 3 ID_3 I only want to flag this if the string includes "4+" or "4 +", but it is being flagged anyway. Is there anyway to capture it exactly?


Solution

  • We can escape the + to evaluate it literally

    +(grepl(paste0( "(", gsub("\\+", "\\\\+", na.omit(codes$column1)), ")",
         collapse="|"), strings$Open_comments))
    #[1] 0 0 0 0
    

    If we use a string with 4+ , it would pick up

    +(grepl(paste0( "(", gsub("\\+", "\\\\+", na.omit(codes$column1)), ")",
         collapse="|"), "Flag only if there is a 4+ with a plus"))
    #[1] 1
    

    And for the multiple columns

    sapply(codes, function(x)+(grepl(paste0( "\\b(", 
          gsub("\\+", "\\\\+", na.omit(x)), ")\\b",
          collapse="|"), strings$Open_comments)))
    #     column1 column2 column3
    #[1,]       0       0       0
    #[2,]       0       1       0
    #[3,]       0       0       0
    #[4,]       0       0       0