Search code examples
regexpcre

PCRE Regex - Match everything to the first pipe not enclosed by square brackets


I have the following line of text, where I am trying to extract everything up to the first pipe character that is not enclosed in square brackets.

action=search sourcetype=audittrail [ localop | stats count | eval search_id = replace("$top10_drilldown_sid$", "^remote_[^_]*_", "") | table search_id ] [ localop | stats count | eval earliest = $top10_drilldown_earliest$ - 86400 | table earliest ] latest="$top10_drilldown_latest$" | stats values(savedsearch_name) AS search_name

Expected output:

action=search sourcetype=audittrail [ localop | stats count | eval search_id = replace("$top10_drilldown_sid$", "^remote_[^_]*_", "") | table search_id ] [ localop | stats count | eval earliest = $top10_drilldown_earliest$ - 86400 | table earliest ] latest="$top10_drilldown_latest$"

i.e. Everything but the trailing | stats values(savedsearch_name) AS search_name

Following some lookaround examples, I could (nearly) get what I needed using a JavaScript Regex expression

/.*\|(?![^\[]*\])/g
But this didn't translate well into a PCRE-compatible expression that worked (plus I want to capture everything up to, but not including, the first pipe).

From what I've read, the nested square brackets in the first bracketed set may be a complication that can't be worked around? There would only be one level of nested brackets in any given set (e.g. [..[]..] or [..[]..[]..])

I admit that I don't think I've got my head fully around positive & negative lookarounds, but any help would be greatly appreciated!


Solution

  • In this kind of situation, it's more efficient to match all that isn't the delimiter than trying to split:

    (?=[^|])[^][|]*(?:(\[[^][]*+(?:(?1)[^][]*)*+])[^][|]*)*
    

    demo

    details:

    (?=[^|]) # lookahead: ensure there's at least one non pipe character at the
             # current position, the goal is to avoid empty match.
    [^][|]* # all that isn't a bracket or a pipe
    (?:
        (  # open the capture group 1: describe a bracket part
            \[
             [^][]*+ # all that isn't a bracket (note that you don't have to care
                     # about of the pipe here, you are between brackets)
             (?:
                 (?1)  # refer to the capture group 1 subpattern (it's a recursion
                       # since this reference is in the capture group 1 itself)
                 [^][]* 
             )*+
             ]
        ) # close the capture group 1
        [^][|]*
    )*
    

    If you need empty parts too, you can rewrite it like this:

    (?=[^|])[^][|]*(?:(\[[^][]*+(?:(?1)[^][]*)*+])[^][|]*)*|(?<=\|)