Search code examples
pythonpdfpypdfpdf-reader

Get Text with a PDF Reader?


How can I get only this simple text when I read a pdf?

CLSAI10608

This code always start with a CLXXXXXXXX, LEN = 10.

Code:

import PyPDF2
file = open('document.pdf', 'rb')
pdfreader = PyPDF2.PdfFileReader(file)
pageobj = pdfreader.getPage(0)
print(pageobj.extractText())

output:

output


Solution

  • So the regex pattern I came up with searchs for something starting with CL and then 8 non-whitespace characters. regex101.com provides a handy explanation.

    import re
    
    string = r"""Detalle
    
    Total
    
    4040CL02
    
      Correccion de BL
    
    CLSAI10608LV-PASSERO V0008-MBL : ISGA0F000
    
    47.020"""
    
    match = re.search(r"[C][L]\S{8}", string)
    if match:
        code = match.group()
        print(code)
    

    Output: CLSAI10608

    So you'd want to replace string with pageobj.extractText().