Search code examples
rregexdplyrtext

Detecting text matching specific regex in dplyr


I hope all is well. I need to find which row has:

  1. only the term pro with an space before and no characters after, or
  2. the term certificat with an space before, and characters can be after such as certificate or certification, or
  3. any digit (could be one digit or more)

A piece of data is

df_new <- data.frame(
  given_info=c('SA12 is given','he is Pro writer',
               'she programmed','why not having an ra31',
               'his bag missing', 'pa12 and certificate are given',
               'schedule is ready','certification was awarded',
               'meeting is canceled'))

df_new %>% select(given_info)
                      given_info
1                  SA12 is given
2               he is Pro writer
3                 she programmed
4         why not having an 1672
5                his bag missing
6 gift and certificate are given
7              schedule is ready
8      certification was awarded
9            meeting is canceled

Hence, the outcome of interest would be like:

                      given_info      string_detected
1                  SA12 is given              1
2               he is Pro writer              1
3                 she programmed              0
4         why not having an 1672              1
5                his bag missing              0
6 gift and certificate are given              1  
7              schedule is ready              0
8      certification was awarded              1
9            meeting is canceled              0


Solution

  • Something like this:

    • (^|\\s)[Pp]ro(\\s|$) ... matches the word "Pro" or "pro" surrounded by whitespace or appears at the beginning or end of the string
    • (^|\\s)[Cc]ertificat(e|ion)?(\\s|$)... matches either "[Cc]ertificate", "[Cc]ertification", or "[Cc]ertificat" surrounded by whitespace or appears at the beginning or end of the string.
    • \\d+... matches any sequence of one or more digits.
    library(dplyr)
    library(stringr)
    
    df_new %>%
      mutate(string_detected = as.integer(str_detect(given_info, "(^|\\s)[Pp]ro(\\s|$)") |
                                            str_detect(given_info, "(^|\\s)[Cc]ertificat(e|ion)?(\\s|$)") |
                                            str_detect(given_info, "\\d+")))
    
                          given_info string_detected
    1                  SA12 is given               1
    2               he is Pro writer               1
    3                 she programmed               0
    4         why not having an ra31               1
    5                his bag missing               0
    6 pa12 and certificate are given               1
    7              schedule is ready               0
    8      certification was awarded               1
    9            meeting is canceled               0