So I am trying to split my string based on all the punctuations and space wherever they occur in the string (hence the + sign) except for on "#" & "/" because I don't want it to split #n/a which it does. I did search a lot on this problem but can't get to the solution. Any suggestions?
t<-"[[:punct:][:space:]]+"
bh <- tolower(strsplit(as.character(a), t)[[1]])
I have also tried storing the following to t but it also gives error
t<-"[!"\$%&'()*+,\-.:;<=>?@\[\\\]^_`{|}~\\ ]+"
Error: unexpected input in "t<-"[!"\"
One alternate is to substitute #n/a but I want to know how to do it without having to do that.
You may use a PCRE regex with a lookahead that will restrict the bracket expression pattern:
t <- "(?:(?![#/])[[:punct:][:space:]])+"
bh <- tolower(strsplit(as.character(a), t, perl=TRUE)[[1]])
The (?:(?![#/])[[:punct:][:space:]])+
pattern matches 1 or more repetitions of any punctuation or whitespace that is not #
and /
chars.
See the regex demo.
If you want to spell out the symbols you want to match inside a bracket expression you may fix your other pattern like
t <- "[][!\"$%&'()*+,.:;<=>?@\\\\^_`{|}~ -]+"
Note that ]
must be right after the opening [
, [
inside the expression does not need to be escaped, -
can be put unescaped at the end, a \
should be defined with 4 backslashes. $
does not have to be escaped.