I am trying to create a regular expression that will allow me to split the strings below on the central comma only.
str_1 <- "N(0, 1)"
str_2 <- "N(N(0.1, 1), 1)"
str_3 <- "N(U(0, 1), 1)"
str_4 <- "N(0, T(0, 1))"
str_5 <- "N(N(0, 1), N(0, 1))"
Think of them as parameters of the distributions. Now, I would like to split on the comma of the "top-level".
Some details: The numbers can be decimal numbers and both positive and negative. They will always be grouped either inside U()
, N()
, LN()
or T()
and separated by a comma. More groupings will be added later, so a more general solution is required or that it is easily extendable. What I am looking to do is split the expressions at the "top level" comma.
Now, the first case of str_1
is straight forward using:
unlist(strsplit(str_1, ",", perl = TRUE))
Before I proceed, I need to know whether I have a nesting. I know that I will have more than one of either N, U, LN or T if there is a nesting. So to check, I did (for str_2
):
length(attr(gregexpr("(N|LN|U|T)", str_2, perl = TRUE)[[1]], "match.length")) > 1
Having established whether I have a nesting (might be a cleaner way to test this?), I can proceed to work out the split for the remaining strings. However, this is where I am stuck. Given that I can't count the commas since the cases str_2
, str_3
and str_4
would be ambiguous. How would I ensure that I only split on the central comma?
I expect the following outputs (so trimming away the first letter and parenthesis and last parenthesis)
# str_2
"N(0.1, 1)" "1"
# str_3
"U(0, 1)" "1"
# str_4
"0" "T(0, 1)"
# str_5
"N(0, 1)" "N(0, 1)"
I would like to stay with base R to reduce the number of dependencies for the code if possible. Any help is much appreciated. It is also possible that this is not solvable by a regex, but requires a programatic approach possibly by recursion as suggeste in this Java question.
If your character vectors are in the format you showed, you can achieve what you need with a single PCRE regex:
(?:\G(?!^)\s*,\s*|^N\()\K(?:\d+|\w+(\([^()]*(?:(?1)[^()]*)*\)))(?=\s*,|\)$)
See the regex demo. Details
(?:\G(?!^)\s*,\s*|^N\()
- end of the previous successful match (\G(?!^)
) and then a comma enclosed with zero or more whitespace chars (\s*,\s*
) or a N(
string at the start of the string (^N\(
)\K
- a match reset operator that discards all text matched so far from the current match memory buffer(?:
- start of non-capturing group
\d+
- one or more digits|
- or\w+
- one or more word chars(\([^()]*(?:(?1)[^()]*)*\))
- Group 1 (needed for recursion to work correctly): a (
, then any zero or more chars other than a (
and )
, then zero or more occurrences of the Group 1 pattern (recursed) and then zero or more chars other than (
and )
and then a )
char)
- end of the non-capturing group(?=\s*,|\)$)
- immediately followed with zero or more whitespaces and then a comma or )
char at the end of string.See the regex demo:
strs <- c("N(0, 1)", "N(N(0.1, 1), 1)", "N(U(0, 1), 1)", "N(0, T(0, 1))", "N(N(0, 1), N(0, 1))")
p <- "(?:\\G(?!^)\\s*,\\s*|^N\\()\\K(?:\\d+|\\w+(\\([^()]*(?:(?1)[^()]*)*\\)))(?=\\s*,|\\)$)"
regmatches(strs, gregexpr(p, strs, perl=TRUE))
# => [[1]]
# [1] "0" "1"
#
# [[2]]
# [1] "N(0.1, 1)" "1"
#
# [[3]]
# [1] "U(0, 1)" "1"
#
# [[4]]
# [1] "0" "T(0, 1)"
#
# [[5]]
# [1] "N(0, 1)" "N(0, 1)"