Search code examples
.netregexregex-groupnegative-lookbehind

Regex to discard an entire capture if it's immediately preceded by a specific character


Given the following text:

somerandomtext06251/750/somerandomtext/21399/10 79/20 8301

how do I extract 06251/750, 79/20, 8301 and ignore 21399/10 ?

The general rules:

  • in a random string match every group of at least 2 digits followed by optional / and followed by another at least 2 digits; be greedy about the digits (take as much as possible)
  • ignore the complete match if it is immediately preceded by /

I started with the following match pattern:

 (?<invnr>\d{2,}/?\d{2,})

In general, it works, but it has just one problem: it takes also 21399/10. So, I added a negative lookbehind:

 (?<!/)(?<invnr>\d{2,}/?\d{2,})

Now it ignores the first digit of 21399/10 (because it is preceded by /), but still it captures all the following characters, that is 1399/10. But I need to skip 21399/10 entirely.

How do I make the lookbehind to make dropping entire match and skipping to the next one instead of skipping just one digit?


Solution

  • You may add a digit pattern inside the negative lookbehind (by combining it with / using a character class, [/\d]) to make sure a match can't occur if it immediately follows a digit:

    (?<![/\d])\d{2,}(?:/\d{2,})?
    

    See the regex demo

    Details

    • (?<![/\d]) - a negative lookbehind that fails the match if there is / or a digit immediately to the left of the current location
    • \d{2,} - two or more digits
    • (?:/\d{2,})? - an optional sequence of a / and two or more digits.

    If you need to make sure you only match ASCII digits, pass the RegexOptions.ECMAScript option to the regex compiler inside the .NET method, or use [0-9] instead of \d.

    Note your \d{2,}/?\d{2,} is a bit off since it won't match 2 or 3 digit sequences, only 4+ digit sequences.