How can I get only this simple text when I read a pdf?
CLSAI10608
This code always start with a CLXXXXXXXX, LEN = 10
.
Code:
import PyPDF2
file = open('document.pdf', 'rb')
pdfreader = PyPDF2.PdfFileReader(file)
pageobj = pdfreader.getPage(0)
print(pageobj.extractText())
output:
So the regex pattern I came up with searchs for something starting with CL
and then 8 non-whitespace characters. regex101.com provides a handy explanation.
import re
string = r"""Detalle
Total
4040CL02
Correccion de BL
CLSAI10608LV-PASSERO V0008-MBL : ISGA0F000
47.020"""
match = re.search(r"[C][L]\S{8}", string)
if match:
code = match.group()
print(code)
Output: CLSAI10608
So you'd want to replace string
with pageobj.extractText()
.