Search code examples
rregexstring-matching

R regex match whole word taking punctuation into account


I'm in R. I want to match whole words in text, taking punctuation into account. Example:

to_match = c('eye','nose')
text1 = 'blah blahblah eye-to-eye blah'
text2 = 'blah blahblah eye blah'

I would like eye to be matched in text2 but not in text1.

That is, the command:

to_match[sapply(paste0('\\<',to_match,'\\>'),grepl,text1)]

should return character(0). But right now, it returns eye.

I also tried with '\\b' instead of '\\<', with no success.


Solution

  • Use 

    to_match[sapply(paste0('(?:\\s|^)',to_match,'(?:\\s|$)'),grepl,text1)]
    

    The point is that word boundaries match between a word and a nonword chars, that is why you had a match in eye-to-eye. You want to match only in between start or end of string and whitespace.

    In a TRE regex, this is better done with groups as this regex library does not support lookarounds and you just need to test a string for a single pattern match to return true or false.

    The (?:\s|^) noncapturing group matches any whitespace or start of string and (?:\s|$) matches whitespace or end of string.