python pdf compression text-extraction pdfa

How extract text from this compressed PDF/A?

For machine learning purposes (sckit-learn), I need to extract the raw text from lots of PDF files. First off, I was using xpdf pdftotext to do this task:

exe = r'"'+os.path.join(xpdf_path,"pdftotext.exe")+'"'
cmd = exe+" "+"\""+pdf+"\""+" "+"\""+pdf+".txt"+"\""
subprocess.check_output(cmd)
with open(pdf+".txt") as f:
    texto_converted = f.read()

But unfortunately, for few of them, I was unable to get the text because they are using "stream" on their pdf source, like this one.

The result is something like this:

59!"#$%&'()*+,-.#/#01"21"" 345667.0*(879:4$;<;4=<6>4?$@"12!/ 21#$@A$3A$>@>BCDCEFGCHIJKIJLMNIJILOCNPQRDS QPFTRPUCTCVQWBCTTQXFPYTO"21 "#/!"#(Z[12\&A+],$3^_3;9`Z &a# .2"#.b#"(#c#A(87*95d$d4?$d3e#Z"f#\"#2b?2"#`Z 2"!eb2"#H1TBRgF JhiO
jFK# 2"k#`Z !#212##"elf/e21m#*c!n2!!#/bZ!#2#`Z "eo ]$5<$@;A533> "/\ko/f\#e#e#p

I Even trying using zlib + regex:

import re
import zlib

pdf = open("pdfa.pdf", "rb").read()
stream = re.compile(b'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in re.findall(stream,pdf):
    s = s.strip(b'\r\n')
    try:
        print(zlib.decompress(s).decode('UTF-8'))
        print("")
    except:
        pass

The result was something like this:

1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm
1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm

I even tried pdftopng (xpdf) to try tesseract after, without success So, Is there any way to extract pure text from a PDF like that using Python or a third party app?

Solution

If you want to decompress the streams in a PDF file, I can recommend using qdpf, but on this file

 qpdf --decrypt --stream-data=uncompress document.pdf out.pdf

doesn't help either.

I am not sure though why your efforts with xpdf and tesseract did not work out, using image-magick's convert to create PNG files in a temporary directory and tesseract, you can do:

import os
from pathlib import Path
from tempfile import TemporaryDirectory
import subprocess

DPI=600

def call(*args):
    cmd = [str(x) for x in args]
    return subprocess.check_output(cmd, stderr=subprocess.STDOUT).decode('utf-8')

def ocr(docpath, lang):
    result = []
    abs_path = Path(docpath).expanduser().resolve()
    old_dir = os.getcwd()
    out = Path('out.txt')
    with TemporaryDirectory() as tmpdir:
         os.chdir(tmpdir)
         call('convert', '-density', DPI, abs_path, 'out.png')
         index = -1
         while True:
             # names have no leading zeros on the digits, would be difficult to sort glob() output
             # so just count them
             index += 1
             png = Path(f'out-{index}.png')
             if not png.exists():
                 break
             call('tesseract', '--dpi', DPI, png, out.stem, '-l', lang)
             result.append(out.read_text())
         os.chdir(old_dir)
    return result

pages = ocr('~/Downloads/document.pdf', 'por')
print('\n'.join(pages[1].splitlines()[21:24]))

which gives:

DA NÃO REALIZAÇÃO DE AUDIÊNCIA DE AUTOCOMPOSIÇÃO NO CASO EM CONCRETO

Com vista a obter maior celeridade processual, assim como da impossibilidade de conciliação entre

If you are on Windows, make sure your PDF file is not open in a different process (like a PDF viewer), as Windows doesn't seem to like that.

The final print is limited as the full output is quite large.

This converting and OCR-ing takes a while so you might want to uncomment the print in call() to get some sense of progress.