Search code examples
pythonpython-3.xpdfpypdf

What are some alternatives to PyPDF2 for managing PDF files?


Attempting to read the daily works of a Parliament, I discovered the documents are splintered into many PDF documents which cannot be simply opened by the browser to read and must be downloaded individually. My basic idea is to download all the docs and extract the titles of all the decisions taken

Previous threads suggest using PyPDF2. Apparently this does not work at all in my case. The characters in the PDF are greek letters so perhaps the encoding has something to do with it. On top of that, at the end of the document, there are some pictures added (which are of no interest to me).

Is there any chance PyPDF2 can pull this off or should I look elsewhere?


Solution

  • if you're just after the text, it seems that PyPDF2 doesn't support CMaps and you'll therefore get garbage back if you try to do:

    from PyPDF2 import PdfFileReader
    
    with open('document.pdf', 'rb') as fd:
      pdf = PdfFileReader(fd)
      p1 = pdf.getPage(0)
      print(p1.extractText())
    

    there's an open pull request to fix this. it's not been merged, but you could pull that code out if you want it as it looks pretty self contained.