Search code examples
pythonpdfpython-re

How can I find pattern when Regex reads digits as string type?


I am trying to code a pdf reader script. When I write the pattern in RE, it returns nothing.

Input:

import requests
import pdfplumber
import pandas as pd
import re

with pdfplumber.open("file.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()

decl = re.compile(r'\d{8}AN\d{6}')

for line in text.split('\n'):
    if decl.search(line):
        print(line)

Searched text line from pdf file is 'CHEMISCHE FABRIK BUDENHEIM KG PO BOX 61245366AN206589'

But it does not return the required output of: 61245366AN206589

I found out that it reads the whole line as string. How can I work around this?

for char in text.split('\n')[3]:
    print(char)
    print(type(char))

. . . <class 'str'> B <class 'str'> O <class 'str'> X <class 'str'>

<class 'str'> 6 <class 'str'> 1 <class 'str'> 2 <class 'str'> 4 <class 'str'> 5 <class 'str'> 3 <class 'str'> 6 <class 'str'> 6 <class 'str'> A <class 'str'> N <class 'str'> 2 <class 'str'> 0 <class 'str'> 6 <class 'str'> 5 <class 'str'> 8 <class 'str'> 9 <class 'str'>


Solution

  • search returns a match object, so if there is a match, you need to extract the result from that.

    Here it is in an interactive session:

    >>> decl = re.compile(r"\d{8}AN\d{6}")
    >>> m = decl.search("CHEMISCHE FABRIK BUDENHEIM KG PO BOX 61245366AN206589")
    >>> m
    <re.Match object; span=(37, 53), match='61245366AN206589'>
    >>> m.group(0)
    '61245366AN206589'
    >>> m.span()
    (37, 53)
    

    The span is the location of the match in the search text, using slice notation values.