Search code examples
c#regexmatchmatching

How to treat underscore as white space extracting document number


Invoice words are sometimes delimited by underscore character (_) in addition or instead of white space:

...
Some nr_11687767_ other 101308591
Invoice Nr.
M230714_some text
Kirjeldus
...

Sometimes it is terminated by newline

...
This nr_11687767_KMKR_EE101308591
Invoice Nr.
M230714
01.05.2023
Item
...

or by other white space delimiter :

...
Some  nr_11687767_ Text
Invoice Nr M230714   Date 01.05.2023
Desc
...

Tried to extract number using RegEx

  Regex.Match(tekst, @"(?si).*_?Invoice[\s_]?NR[\s_:\.]?(?<arvenumber>.*?)[\s_]");

Success is true but arvenumber group is empty.

How to get only number M230714 in arvenumber group ?

Using C# ASP.NET 7


Solution

  • I suggest a pattern like this

    (?i)Invoice\s+Nr\.?[\s_]+(?<arvenumber>[\p{L}0-9]+)
    

    where

    (?i)                        - Ignore case when matching
    Invoice                     - "Invoice"
    \s+                         - One or more whitespaces
    Nr\.?                       - "Nr" with optional .  
    [\s_]+                      - One or more namespaces or _
    (?<arvenumber>[\p{L}0-9]+)  - arvenumber which contains of letters and / or digits
    

    Fiddle