Say I have the following string:
params <- "var1 /* first, variable */, var2, var3 /* third, variable */"
I want to split it using ,
as a separator, then extract the "quoted substrings", so I get 2 vectors as follow :
params_clean <- c("var1","var2","var3")
params_def <- c("first, variable","","third, variable") # note the empty string as a second element.
I use the term "quoted" in a wide sense, with arbitrary strings, here /*
and */
, which protect substrings from being split.
I found a workaround based on read.table
and the fact it allows quoted elements :
library(magrittr)
params %>%
gsub("/\\*","_temp_sep_ '",.) %>%
gsub("\\*/","'",.) %>%
read.table(text=.,strin=F,sep=",") %>%
unlist %>%
unname %>%
strsplit("_temp_sep_") %>%
lapply(trimws) %>%
lapply(`length<-`,2) %>%
do.call(rbind,.) %>%
inset(is.na(.),value="")
But it's quite ugly and hackish, what's a simpler way ? I'm thinking there must be a regex
to feed to strsplit
for this situation.
related to this question
You may use
library(stringr)
cmnt_rx <- "(\\w+)\\s*(/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/)?"
res <- str_match_all(params, cmnt_rx)
params_clean <- res[[1]][,2]
params_clean
## => [1] "var1" "var2" "var3"
params_def <- gsub("^/[*]\\s*|\\s*[*]/$", "", res[[1]][,3])
params_def[is.na(params_def)] <- ""
params_def
## => [1] "first, variable" "" "third, variable"
The main regex details (it is actually (\w+)\s*)(COMMENTS_REGEX)?
):
(\w+)
- Capturing group 1: one or more word chars\s*
- 0+ whitespace chars(
- Capturing group 2 start/\*
- match the comment start /*
[^*]*\*+
- match 0+ characters other than *
followed with 1+ literal *
(?:[^/*][^*]*\*+)*
- 0+ sequences of:
[^/*][^*]*\*+
- not a /
or *
(matched with [^/*]
) followed with 0+ non-asterisk characters ([^*]*
) followed with 1+ asterisks (\*+
)/
- closing /
)?
- Capturing group 2 end, repeat 1 or 0 times (it means it is optional).See the regex demo.
The "^/[*]\\s*|\\s*[*]/$"
pattern in gsub
removes /*
and */
with adjoining spaces.
params_def[is.na(params_def)] <- ""
part replaces NA
with empty strings.