I am trying with Python 3.7 to recognize patterns in pdf documents by extracting the elements with regular expressions. I have two casuistics when I extract the data:
The first is that the result comes to me as follows:
R.U.T .: 99.999.999-9
COMPANY
ELECTRONIC TICKET
Committed to you
N
54280631
COMPANY S.A. SALE
RUT: 99.999.999-9 Directory 111, City
And the second case is:
R.U.T .: 99.999.999-9
COMPANY
ELECTRONIC TICKET
Committed to you
N 54280631
COMPANY S.A. SALE
RUT: 99.999.999-9 Directory 111, City
I need a regular expression that can get both cases to get the invoice number with RegeX. In this case the invoice number is "N 54280631".
I have tried the following regex however it does not work for one of the two cases.
([N]).*\n+([0-9])+.*\w+
Any idea what the regex should look like to have that result?
You can use
(?m)^N\s+(\d+)$
Details:
(?m)
- an inline re.M
modifier that makes ^
and $
match start/end of line positions^
- start of a lineN
- N
char\s+
- one or more whitespaces(\d+)
- Group 1: one or more digits$
- end of line.See the regex demo.
In Python, you can use either re.findall
to get all match occurrences, or re.search
to get the first match only:
import re
text = 'Your_text_here'
pattern = r'^N\s+(\d+)$'
# First match:
m = re.search(pattern, text, re.M)
if m:
print(m.group(1))
# Get all occurrences:
print( re.findall(pattern, text, re.M) )