I am trying to code a pdf reader script. When I write the pattern in RE, it returns nothing.
Input:
import requests
import pdfplumber
import pandas as pd
import re
with pdfplumber.open("file.pdf") as pdf:
page = pdf.pages[0]
text = page.extract_text()
decl = re.compile(r'\d{8}AN\d{6}')
for line in text.split('\n'):
if decl.search(line):
print(line)
Searched text line from pdf file is 'CHEMISCHE FABRIK BUDENHEIM KG PO BOX 61245366AN206589'
But it does not return the required output of: 61245366AN206589
I found out that it reads the whole line as string. How can I work around this?
for char in text.split('\n')[3]:
print(char)
print(type(char))
. . . <class 'str'> B <class 'str'> O <class 'str'> X <class 'str'>
<class 'str'> 6 <class 'str'> 1 <class 'str'> 2 <class 'str'> 4 <class 'str'> 5 <class 'str'> 3 <class 'str'> 6 <class 'str'> 6 <class 'str'> A <class 'str'> N <class 'str'> 2 <class 'str'> 0 <class 'str'> 6 <class 'str'> 5 <class 'str'> 8 <class 'str'> 9 <class 'str'>
search
returns a match object, so if there is a match, you need to extract the result from that.
Here it is in an interactive session:
>>> decl = re.compile(r"\d{8}AN\d{6}")
>>> m = decl.search("CHEMISCHE FABRIK BUDENHEIM KG PO BOX 61245366AN206589")
>>> m
<re.Match object; span=(37, 53), match='61245366AN206589'>
>>> m.group(0)
'61245366AN206589'
>>> m.span()
(37, 53)
The span
is the location of the match in the search text, using slice notation values.