python pdf pypdf pdf-reader

Get Text with a PDF Reader?

How can I get only this simple text when I read a pdf?

CLSAI10608

This code always start with a CLXXXXXXXX, LEN = 10.

Code:

import PyPDF2
file = open('document.pdf', 'rb')
pdfreader = PyPDF2.PdfFileReader(file)
pageobj = pdfreader.getPage(0)
print(pageobj.extractText())

output:

Solution

So the regex pattern I came up with searchs for something starting with CL and then 8 non-whitespace characters. regex101.com provides a handy explanation.

import re

string = r"""Detalle

Total

4040CL02

  Correccion de BL

CLSAI10608LV-PASSERO V0008-MBL : ISGA0F000

47.020"""

match = re.search(r"[C][L]\S{8}", string)
if match:
    code = match.group()
    print(code)

Output: CLSAI10608

So you'd want to replace string with pageobj.extractText().