Search code examples
rregexstringrecursionpcre

Splitting a string with nested parentheses at only the top level where "level" is determined by the parentheses


I am trying to create a regular expression that will allow me to split the strings below on the central comma only.

str_1 <- "N(0, 1)"
str_2 <- "N(N(0.1, 1), 1)"
str_3 <- "N(U(0, 1), 1)"
str_4 <- "N(0, T(0, 1))"
str_5 <- "N(N(0, 1), N(0, 1))"

Think of them as parameters of the distributions. Now, I would like to split on the comma of the "top-level".

Some details: The numbers can be decimal numbers and both positive and negative. They will always be grouped either inside U(), N(), LN() or T() and separated by a comma. More groupings will be added later, so a more general solution is required or that it is easily extendable. What I am looking to do is split the expressions at the "top level" comma.

Now, the first case of str_1 is straight forward using:

unlist(strsplit(str_1, ",", perl = TRUE))

Before I proceed, I need to know whether I have a nesting. I know that I will have more than one of either N, U, LN or T if there is a nesting. So to check, I did (for str_2):

length(attr(gregexpr("(N|LN|U|T)", str_2, perl = TRUE)[[1]], "match.length")) > 1

Having established whether I have a nesting (might be a cleaner way to test this?), I can proceed to work out the split for the remaining strings. However, this is where I am stuck. Given that I can't count the commas since the cases str_2, str_3 and str_4 would be ambiguous. How would I ensure that I only split on the central comma?

I expect the following outputs (so trimming away the first letter and parenthesis and last parenthesis)

# str_2
"N(0.1, 1)" "1"

# str_3
"U(0, 1)" "1"

# str_4
"0" "T(0, 1)"

# str_5
"N(0, 1)" "N(0, 1)"

I would like to stay with base R to reduce the number of dependencies for the code if possible. Any help is much appreciated. It is also possible that this is not solvable by a regex, but requires a programatic approach possibly by recursion as suggeste in this Java question.


Solution

  • If your character vectors are in the format you showed, you can achieve what you need with a single PCRE regex:

    (?:\G(?!^)\s*,\s*|^N\()\K(?:\d+|\w+(\([^()]*(?:(?1)[^()]*)*\)))(?=\s*,|\)$)
    

    See the regex demo. Details

    • (?:\G(?!^)\s*,\s*|^N\() - end of the previous successful match (\G(?!^)) and then a comma enclosed with zero or more whitespace chars (\s*,\s*) or a N( string at the start of the string (^N\()
    • \K - a match reset operator that discards all text matched so far from the current match memory buffer
    • (?: - start of non-capturing group
      • \d+ - one or more digits
      • | - or
      • \w+ - one or more word chars
      • (\([^()]*(?:(?1)[^()]*)*\)) - Group 1 (needed for recursion to work correctly): a (, then any zero or more chars other than a ( and ), then zero or more occurrences of the Group 1 pattern (recursed) and then zero or more chars other than ( and ) and then a ) char
    • ) - end of the non-capturing group
    • (?=\s*,|\)$) - immediately followed with zero or more whitespaces and then a comma or ) char at the end of string.

    See the regex demo:

    strs <- c("N(0, 1)", "N(N(0.1, 1), 1)", "N(U(0, 1), 1)", "N(0, T(0, 1))", "N(N(0, 1), N(0, 1))")
    p <- "(?:\\G(?!^)\\s*,\\s*|^N\\()\\K(?:\\d+|\\w+(\\([^()]*(?:(?1)[^()]*)*\\)))(?=\\s*,|\\)$)"
    regmatches(strs, gregexpr(p, strs, perl=TRUE))
    # => [[1]]
    #    [1] "0" "1"
    #    
    #    [[2]]
    #    [1] "N(0.1, 1)" "1"        
    #    
    #    [[3]]
    #    [1] "U(0, 1)" "1"      
    #    
    #    [[4]]
    #    [1] "0"       "T(0, 1)"
    #    
    #    [[5]]
    #    [1] "N(0, 1)" "N(0, 1)"