python regex expression regex-lookarounds

Use Regex in two different cases

I am trying with Python 3.7 to recognize patterns in pdf documents by extracting the elements with regular expressions. I have two casuistics when I extract the data:

The first is that the result comes to me as follows:

R.U.T .: 99.999.999-9
COMPANY
ELECTRONIC TICKET
Committed to you
N
54280631
COMPANY S.A. SALE
RUT: 99.999.999-9 Directory 111, City

And the second case is:

R.U.T .: 99.999.999-9
COMPANY
ELECTRONIC TICKET
Committed to you
N 54280631
COMPANY S.A. SALE
RUT: 99.999.999-9 Directory 111, City

I need a regular expression that can get both cases to get the invoice number with RegeX. In this case the invoice number is "N 54280631".

I have tried the following regex however it does not work for one of the two cases.

([N]).*\n+([0-9])+.*\w+

Any idea what the regex should look like to have that result?

Solution

You can use

(?m)^N\s+(\d+)$

Details:

(?m) - an inline re.M modifier that makes ^ and $ match start/end of line positions
^ - start of a line
N - N char
\s+ - one or more whitespaces
(\d+) - Group 1: one or more digits
$ - end of line.

See the regex demo.

In Python, you can use either re.findall to get all match occurrences, or re.search to get the first match only:

import re
text = 'Your_text_here'
pattern = r'^N\s+(\d+)$'
# First match:
m = re.search(pattern, text, re.M)
if m:
    print(m.group(1))

# Get all occurrences:
print( re.findall(pattern, text, re.M) )