Search code examples
rregexstrsplit

How can I split a string and ignore the delimiter if it's "quoted"


Say I have the following string:

params <- "var1 /* first, variable */, var2, var3 /* third, variable */"

I want to split it using , as a separator, then extract the "quoted substrings", so I get 2 vectors as follow :

params_clean <- c("var1","var2","var3")
params_def   <- c("first, variable","","third, variable") # note the empty string as a second element.

I use the term "quoted" in a wide sense, with arbitrary strings, here /* and */, which protect substrings from being split.

I found a workaround based on read.table and the fact it allows quoted elements :

library(magrittr)
params %>%
  gsub("/\\*","_temp_sep_ '",.) %>%
  gsub("\\*/","'",.) %>%
  read.table(text=.,strin=F,sep=",") %>%
  unlist %>%
  unname %>%
  strsplit("_temp_sep_") %>%
  lapply(trimws) %>%
  lapply(`length<-`,2) %>%
  do.call(rbind,.) %>%
  inset(is.na(.),value="")

But it's quite ugly and hackish, what's a simpler way ? I'm thinking there must be a regex to feed to strsplit for this situation.

related to this question


Solution

  • You may use

    library(stringr)
    cmnt_rx <- "(\\w+)\\s*(/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/)?"
    res <- str_match_all(params, cmnt_rx)
    params_clean <- res[[1]][,2]
    params_clean
    ## => [1] "var1" "var2" "var3"
    params_def <- gsub("^/[*]\\s*|\\s*[*]/$", "", res[[1]][,3])
    params_def[is.na(params_def)] <- ""
    params_def
    ## => [1] "first, variable" ""                "third, variable"
    

    The main regex details (it is actually (\w+)\s*)(COMMENTS_REGEX)?):

    • (\w+) - Capturing group 1: one or more word chars
    • \s* - 0+ whitespace chars
    • ( - Capturing group 2 start
    • /\* - match the comment start /*
    • [^*]*\*+ - match 0+ characters other than * followed with 1+ literal *
    • (?:[^/*][^*]*\*+)* - 0+ sequences of:
      • [^/*][^*]*\*+ - not a / or * (matched with [^/*]) followed with 0+ non-asterisk characters ([^*]*) followed with 1+ asterisks (\*+)
    • / - closing /
    • )? - Capturing group 2 end, repeat 1 or 0 times (it means it is optional).

    See the regex demo.

    The "^/[*]\\s*|\\s*[*]/$" pattern in gsub removes /* and */ with adjoining spaces.

    params_def[is.na(params_def)] <- "" part replaces NA with empty strings.