Search code examples

How can I split a string and ignore the delimiter if it's "quoted"

Say I have the following string:

params <- "var1 /* first, variable */, var2, var3 /* third, variable */"

I want to split it using , as a separator, then extract the "quoted substrings", so I get 2 vectors as follow :

params_clean <- c("var1","var2","var3")
params_def   <- c("first, variable","","third, variable") # note the empty string as a second element.

I use the term "quoted" in a wide sense, with arbitrary strings, here /* and */, which protect substrings from being split.

I found a workaround based on read.table and the fact it allows quoted elements :

params %>%
  gsub("/\\*","_temp_sep_ '",.) %>%
  gsub("\\*/","'",.) %>%
  read.table(text=.,strin=F,sep=",") %>%
  unlist %>%
  unname %>%
  strsplit("_temp_sep_") %>%
  lapply(trimws) %>%
  lapply(`length<-`,2) %>%,.) %>%

But it's quite ugly and hackish, what's a simpler way ? I'm thinking there must be a regex to feed to strsplit for this situation.

related to this question


  • You may use

    cmnt_rx <- "(\\w+)\\s*(/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/)?"
    res <- str_match_all(params, cmnt_rx)
    params_clean <- res[[1]][,2]
    ## => [1] "var1" "var2" "var3"
    params_def <- gsub("^/[*]\\s*|\\s*[*]/$", "", res[[1]][,3])
    params_def[] <- ""
    ## => [1] "first, variable" ""                "third, variable"

    The main regex details (it is actually (\w+)\s*)(COMMENTS_REGEX)?):

    • (\w+) - Capturing group 1: one or more word chars
    • \s* - 0+ whitespace chars
    • ( - Capturing group 2 start
    • /\* - match the comment start /*
    • [^*]*\*+ - match 0+ characters other than * followed with 1+ literal *
    • (?:[^/*][^*]*\*+)* - 0+ sequences of:
      • [^/*][^*]*\*+ - not a / or * (matched with [^/*]) followed with 0+ non-asterisk characters ([^*]*) followed with 1+ asterisks (\*+)
    • / - closing /
    • )? - Capturing group 2 end, repeat 1 or 0 times (it means it is optional).

    See the regex demo.

    The "^/[*]\\s*|\\s*[*]/$" pattern in gsub removes /* and */ with adjoining spaces.

    params_def[] <- "" part replaces NA with empty strings.