Search code examples
rregexstring-parsing

split a string but ignore separators surrounded by given characters


I would like to split a string but only use the separator if it's not surrounded by given sets of characters

current :

strsplit("1 ? 2 ? (3 ? 4) ? {5 ? (6 ? 7)}","\\?")
#> [[1]]
#> [1] "1 "   " 2 "  " (3 " " 4) " " {5 " " (6 " " 7)}"

expected :

strsplit2 <- function(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE,
                      escape = c("()","{}","[]","''",'""',"%%")){
  # ... 
}
strsplit2("1 ? 2 ? (3 ? 4) ? {5 ? (6 ? 7)}","\\?")
#> [[1]]
#> [1] "1 "   " 2 "  " (3 ? 4) " " {5 ? (6 ? 7)}"

I solved this with some complex parsing but I worry about the performance and wonder if regex can be faster.


FYI :

My current solution (not really that relevant to the question) is :

parse_qm_args <- function(x){
  x <- str2lang(x)
  # if single symbol
  if(is.symbol(x)) return(x)
  i <- numeric(0)
  out <- character(0)
  while(identical(x[[c(i,1)]], quote(`?`)) &&
        (!length(i) || length(x[[i]]) == 3)){
    out <- c(x[[c(i,3)]],out)
    i <- c(2, i)
  }
  # if no `?` was found
  if(!length(out)) return(x)

  if(length(x[[i]]) == 2) {
    # if we have a unary `?` fetch its arg
    out <-  c(x[[c(i,2)]],out)
  } else {
    # if we have a binary `?` fetch the its first arg
    out <-  c(x[[c(i)]], out)
  }
  out
}

Solution

  • The best idea will be to use recursion. In that case, you will capture all the grouped elements together then split on the ungrouped deliminator :

    pattern = "([({'](?:[^(){}']*|(?1))*[')}])(*SKIP)(*FAIL)|\\?"
    
    x1 <- "1 ? 2 ? (3 ? 4) ? {5 ? (6 ? 7)}"
    x2 <- "1 ? 2 ? '3 ? 4' ? {5 ? (6 ? 7)}"
    x3 <- "1 ? 2 ? '3 {(? 4' ? {5 ? (6 ? 7)}"
    x4 <- "1 ? 2 ? '(3 ? 4) ? {5 ? (6 ? 7)}'"
    
    strsplit(c(x1,x2,x3, x4),pattern,perl=TRUE)
    
     [[1]]
    [1] "1 "             " 2 "            " (3 ? 4) "      " {5 ? (6 ? 7)}"
    
    [[2]]
    [1] "1 "             " 2 "            " '3 ? 4' "      " {5 ? (6 ? 7)}"
    
    [[3]]
    [1] "1 "             " 2 "            " '3 {(? 4' "    " {5 ? (6 ? 7)}"
    
    [[4]]
    [1] "1 "                         " 2 "                        " '(3 ? 4) ? {5 ? (6 ? 7)}'"