Search code examples
c#regexregex-lookaroundsoverlapping-matches

How can I Prioritize Overlapping Patterns in RegEx?


I've seen several questions similar, even one i posted myself, but this is rather specific.

In regex there is a match pattern. Now say in the same string there are two match patterns that can both match text. It seems my luck always leans towards the regex matching the wrong pattern. (I am using the .Net Regex in C#)

I have two types of strings that I need to break down:

01 - First Value|02 - Second Value|Blank - Ignore

And:

A - First ValueblankB - Second ValueC - Third Value

So my desired result is to match Code to Meaning with one pattern string

Code,Meaning
01,First Value
02,Second Value
Blank,Ignore
A,First Value
blank,
B,Second Value
C,Third Value

I have tried several patterns but can never seem to quite get it right. The closest I have have been able to get is:

(([A-Z0-9]{1,4})[ \-–]{1,3}|([Bb]lank)[ \-–]{0,3})(([A-Z][a-z]+[.,;| ]?)+)

My breakdown:

  • [A-Z0-9]{1,4}[ \-–]{1,3} --> this matches the code, Upper case, or number of length 1 - 4 characters followed by 1 to 3 chars of space, hyphen, or mdash from html.

or

  • [Bb]lank[ \-–]{0,3} --> blank followed 0-3 chars of space, hyphen, or mdash from html

then

  • (([A-Z][a-z]+[.,;| ]?)+) --> should match any multiple word including possible space. so the First and Value, Second and Value should be matched.

The initial problem with that is the final pattern group matches the "Valueblank" in the second input string. I want to somehow prioritize that "[Bb]lank" should be matched as part of the first group and NEVER part of the second group.
I tried putting a (?![Bb]lank) negative lookahead in the finalgroup but it never seems to work. Any help would be appreciated.

Thanks

Jaeden "Sifo Dyas" al'Raec Ruiner


Solution

  • How about the following (regex101.com example):

    /((?:[A-Z0-9]{1,4}|[Bb]lank)(?=\h[-–]\h)|[Bb]lank)(?:\h[-–]\h|\|)?(.*?)(?=[Bb]lank|\||[A-Z0-9]{1,4}\h[-–]\h|$)/gm
    

    Explanation

    [Bb]lank
    

    All matches for "blank" check for a lower OR uppercase "B"

    ((?:[A-Z0-9]{1,4}|[Bb]lank)(?=\h[-–]\h)|[Bb]lank)
    

    The 1st capturing group: match either the alpha numeric first value or a "blank" first value with " - " or " – " after (positive lookahead) OR a "blank" first value that won't have a 2nd matching group.

    (?:\h[-–]\h|\|)?
    

    A separator of " - " OR " – " OR "|" which will occur zero or one times.

    (.*?)
    

    Ungreedily match the 2nd matching group.

    (?=[Bb]lank|\||[A-Z0-9]{1,4}\h[-–]\h|$)
    

    Using a positive lookahead,look for a "blank" OR "|" OR alpha numeric first value with " - " or " – " after OR the end of the line (to catch the last item on the row) to find the end of where we should capture