I have this string
(Mozilla/5.0 \(X11; Linux x86_64\) AppleWebKit/537.36 \(KHTML, like Gecko\) Chrome/data Safari/data2) /Producer (Skia/PDF m80) /CreationDate (D:20200420090009+00'00') /ModDate (D:20200420090009+00'00')
I want to get the first ocurrence of () where there isn't any \ before ( or ). That case I would get
(Mozilla/5.0 \(X11; Linux x86_64\) AppleWebKit/537.36 \(KHTML, like Gecko\) Chrome/data Safari/data2)
I'm using this regex expression
\([\s\S]*[^\\]{1}\)?
However I get the whole string
Your regex can be broken down like so.
[The spaces and newlines are for clarity]
\( match a literal (
[\s\S]* match 0 or more of whitespace or not-whitespace (anything)
[^\\]{1} match 1 thing which is not \
\)? optionally match a literal )
It's that [\s\S]*
which winds up slurping in everything.
The ?
on the end doesn't mean lazy, it makes matching the )
optional. To be lazy, ?
must be put in front of an open-ended qualifier like *?
or +?
or {3,}?
or {1,5}?
.
To match just the first set of parenthesis, we want to lazily match anything between unescaped parens. Lazy matching anything is easy .*?
.
Matching unescaped parens is a little harder. We could match [^\\]\)
, but that requires a character to match. This won't work if the opening paren is at the beginning of the string because there's no character before the (
. We can solve this by also matching the beginning of the string: (?:[^\\]|^)\)
.
(?: non-capturing group
[^\\] match a non \
| or
^ the beginning of the string
)
\( match a literal (
.*? lazy match 0 or more of anything
[^\\] match a non \
\) match a literal )
But this will be foiled by ()
. It will match all of ()(foo)
.
(?:[^\\]|^)
matches the beginning of the string. \(
matches the first (
. That leaves .*?[^\\]\)
looking at )(foo)
. The first )
does not match because there is no leading character, it was already consumed. So .*?
gobbles up characters until it his o)
which matches [^\\]\)
.
The boundary problem is better solved by negative look behinds. (?<!\\)
says the preceding character must not be a \
which includes no character at all. Lookbehinds don't consume what they match so they can be used to peek behind and ahead. Most, but not all, regex engines support them.
(?<!\\) \( match a literal ( which is not after a \
.*? lazy match 0 or more of anything
(?<!\\) \) match a literal ) which is not after a \
However, there are libraries to parse User-Agents. ua-parser has libraries for many languages,