Search code examples
regexregex-lookaroundsregex-greedy

Regexp to extract "numbers" with optional prefix or at beginning of line


I want to extract "numbers" (it can be package number, invoice number etc.) from lines. It's just non-whitespace characters (e.g.: 123, ABC, Abc, ABC123, ABC-123, X-ABC/123/456 - simply \S+ regexp).

I have lines that can contain "numbers". There are two possible cases:

  1. At the beginning of line (first string of non-whitespace charactes).
  2. In the middle of line, but marked with prefix Number: .

Example lines:

ABC123 bla bla
Number: ABC123 bla bla
Some words 123 Number: ABC123 bla bla

From those each example line I want to extract "number": ABC123.


I know how to write regexp for second case (example 2 and 3 lines): (?:Number: )(\S+) (non-captured group with prefix Number: and captured group with non-whitespace charactes).

only first case

But what with first case?

What i tried:

  1. Prefix can be optional: (?:Number: )?(\S+)

I get many matches, but it's not a problem because I can get first match in each line in my code.

But the problem is in match 7: I get word Some instead of number ABC123.

first

  1. Use start line. So there are two alternatives: start line and "number" OR prefix and "number": (?:^(\S+))|(?:(?:Number: )(\S+)).

But the problem is the same, I get word Some. And this is worse because I get Number:

second

  1. I can add not Number: at start of line to eliminate second problem from previous step: (?:^(?!Number:)(\S+))|(?:(?:Number: )(\S+)).

But there is still problem with getting random word (Some) at beginning of line even when prefix Number: exists with "number" in the middle of line.

third


Demo: https://regex101.com/r/G9UFak/1

Question a bit similar to: Regex multiple characters but without specific string


Solution

  • You can use

    (?:.*Number:\s*|^)(\S+)
    

    See the regex demo.

    Details

    • (?:.*Number:\s*|^) - either of the two alternatives:
      • .*Number:\s* - any zero or more chars other than line break chars, as many as possible, Number: and zero or more whitespaces (if you need to stay on the line, replace \s with [^\S\r\n] or \h / [\p{Zs}\t] if supported)
      • | - or
      • ^ - start of a line (with m option in PCR0-like engines)
    • (\S+) - Group 1: any one or more non-whitespace chars.