Search code examples
rregexstrsplit

using regex predefined class with an exception in R


So I am trying to split my string based on all the punctuations and space wherever they occur in the string (hence the + sign) except for on "#" & "/" because I don't want it to split #n/a which it does. I did search a lot on this problem but can't get to the solution. Any suggestions?

t<-"[[:punct:][:space:]]+" 
bh <- tolower(strsplit(as.character(a), t)[[1]])

I have also tried storing the following to t but it also gives error

t<-"[!"\$%&'()*+,\-.:;<=>?@\[\\\]^_`{|}~\\ ]+"

Error: unexpected input in "t<-"[!"\"

One alternate is to substitute #n/a but I want to know how to do it without having to do that.


Solution

  • You may use a PCRE regex with a lookahead that will restrict the bracket expression pattern:

    t <- "(?:(?![#/])[[:punct:][:space:]])+"
    bh <- tolower(strsplit(as.character(a), t, perl=TRUE)[[1]])
    

    The (?:(?![#/])[[:punct:][:space:]])+ pattern matches 1 or more repetitions of any punctuation or whitespace that is not # and / chars.

    See the regex demo.

    If you want to spell out the symbols you want to match inside a bracket expression you may fix your other pattern like

    t <- "[][!\"$%&'()*+,.:;<=>?@\\\\^_`{|}~ -]+"
    

    Note that ] must be right after the opening [, [ inside the expression does not need to be escaped, - can be put unescaped at the end, a \ should be defined with 4 backslashes. $ does not have to be escaped.