Search code examples
pythonpdfcompressiontext-extractionpdfa

How extract text from this compressed PDF/A?


For machine learning purposes (sckit-learn), I need to extract the raw text from lots of PDF files. First off, I was using xpdf pdftotext to do this task:

exe = r'"'+os.path.join(xpdf_path,"pdftotext.exe")+'"'
cmd = exe+" "+"\""+pdf+"\""+" "+"\""+pdf+".txt"+"\""
subprocess.check_output(cmd)
with open(pdf+".txt") as f:
    texto_converted = f.read()

But unfortunately, for few of them, I was unable to get the text because they are using "stream" on their pdf source, like this one.

The result is something like this:

59!"#$%&'()*+,-.#/#01"21"" 345667.0*(879:4$;<;4=<6>4?$@"12!/ 21#$@A$3A$>@>BCDCEFGCHIJKIJLMNIJILOCNPQRDS QPFTRPUCTCVQWBCTTQXFPYTO"21 "#/!"#(Z[12\&A+],$3^_3;9`Z &a# .2"#.b#"(#c#A(87*95d$d4?$d3e#Z"f#\"#2b?2"#`Z 2"!eb2"#H1TBRgF JhiO
jFK# 2"k#`Z !#212##"elf/e21m#*c!n2!!#/bZ!#2#`Z "eo ]$5<$@;A533> "/\ko/f\#e#e#p

I Even trying using zlib + regex:

import re
import zlib

pdf = open("pdfa.pdf", "rb").read()
stream = re.compile(b'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in re.findall(stream,pdf):
    s = s.strip(b'\r\n')
    try:
        print(zlib.decompress(s).decode('UTF-8'))
        print("")
    except:
        pass

The result was something like this:

1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm
1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm

I even tried pdftopng (xpdf) to try tesseract after, without success So, Is there any way to extract pure text from a PDF like that using Python or a third party app?


Solution

  • If you want to decompress the streams in a PDF file, I can recommend using qdpf, but on this file

     qpdf --decrypt --stream-data=uncompress document.pdf out.pdf
    

    doesn't help either.

    I am not sure though why your efforts with xpdf and tesseract did not work out, using image-magick's convert to create PNG files in a temporary directory and tesseract, you can do:

    import os
    from pathlib import Path
    from tempfile import TemporaryDirectory
    import subprocess
    
    DPI=600
    
    def call(*args):
        cmd = [str(x) for x in args]
        return subprocess.check_output(cmd, stderr=subprocess.STDOUT).decode('utf-8')
    
    def ocr(docpath, lang):
        result = []
        abs_path = Path(docpath).expanduser().resolve()
        old_dir = os.getcwd()
        out = Path('out.txt')
        with TemporaryDirectory() as tmpdir:
             os.chdir(tmpdir)
             call('convert', '-density', DPI, abs_path, 'out.png')
             index = -1
             while True:
                 # names have no leading zeros on the digits, would be difficult to sort glob() output
                 # so just count them
                 index += 1
                 png = Path(f'out-{index}.png')
                 if not png.exists():
                     break
                 call('tesseract', '--dpi', DPI, png, out.stem, '-l', lang)
                 result.append(out.read_text())
             os.chdir(old_dir)
        return result
    
    pages = ocr('~/Downloads/document.pdf', 'por')
    print('\n'.join(pages[1].splitlines()[21:24]))
    

    which gives:

    DA NÃO REALIZAÇÃO DE AUDIÊNCIA DE AUTOCOMPOSIÇÃO NO CASO EM CONCRETO
    
    Com vista a obter maior celeridade processual, assim como da impossibilidade de conciliação entre
    

    If you are on Windows, make sure your PDF file is not open in a different process (like a PDF viewer), as Windows doesn't seem to like that.

    The final print is limited as the full output is quite large.

    This converting and OCR-ing takes a while so you might want to uncomment the print in call() to get some sense of progress.