Search code examples
pythonregexexpressionregex-lookarounds

Use Regex in two different cases


I am trying with Python 3.7 to recognize patterns in pdf documents by extracting the elements with regular expressions. I have two casuistics when I extract the data:

The first is that the result comes to me as follows:

R.U.T .: 99.999.999-9
COMPANY
ELECTRONIC TICKET
Committed to you
N
54280631
COMPANY S.A. SALE
RUT: 99.999.999-9 Directory 111, City

And the second case is:

R.U.T .: 99.999.999-9
COMPANY
ELECTRONIC TICKET
Committed to you
N 54280631
COMPANY S.A. SALE
RUT: 99.999.999-9 Directory 111, City

I need a regular expression that can get both cases to get the invoice number with RegeX. In this case the invoice number is "N 54280631".

I have tried the following regex however it does not work for one of the two cases.

([N]).*\n+([0-9])+.*\w+

Any idea what the regex should look like to have that result?


Solution

  • You can use

    (?m)^N\s+(\d+)$
    

    Details:

    • (?m) - an inline re.M modifier that makes ^ and $ match start/end of line positions
    • ^ - start of a line
    • N - N char
    • \s+ - one or more whitespaces
    • (\d+) - Group 1: one or more digits
    • $ - end of line.

    See the regex demo.

    In Python, you can use either re.findall to get all match occurrences, or re.search to get the first match only:

    import re
    text = 'Your_text_here'
    pattern = r'^N\s+(\d+)$'
    # First match:
    m = re.search(pattern, text, re.M)
    if m:
        print(m.group(1))
    
    # Get all occurrences:
    print( re.findall(pattern, text, re.M) )