Search code examples
pythonpdfadobepypdf

Trouble reading some pdfs with PyPDF2


I'm having trouble reading a standard PDF with PyPDF2. The PdfReader class will read the document and give me the correct metadata properties for my document, but examining any other content gives me the filler text that a browser would if I do not have the adobe extension installed:

The document you are trying to load requires Adobe Reader 8 or higher. You may not have the Adobe Reader installed or your viewing environment may not be properly configured to use Adobe Reader. For information on how to install Adobe Reader and configure your viewing environment please see http://www.adobe.com/go/pdf_forms_configure.

I am able to successfully read the metadata for this particular pdf, as well as others published by the same entity and tool.

Some sample code to show the issue:

from PyPDF2 import PdfReader
from pathlib import Path, WindowsPath

award_test = PdfReader(WindowsPath("DA Form 638.pdf"))
print(award_test.metadata)
print(award_test.get_form_text_fields())
print(award_test.pages[0].extract_text())

Yields:

{'/CreationDate': "D:20210517070206-04'00'", '/Creator': 'Designer 6.3', '/Distrubution': 'Unrestricted', '/Doc_Num': '638', '/Form_Month': '04', '/Form_Version': '1.03', '/Form_Year': '2021', '/ModDate': "D:20210517070206-04'00'", '/OMB_Expire': '', '/OMB_Number': '', '/PA_Code': 'No', '/PIN': '083079', '/Pre_Dir': 'AR 600-8-22', '/Prefix': 'DA', '/Producer': 'Designer 6.3', '/Product_Type': 'Form', '/Proponent': 'DCS, G-1', '/Pub_Day': '05', '/Pub_ID': '8-22', '/Pub_Month': '03', '/Pub_Series': '600', '/Pub_Type': 'AR', '/Pub_Year': '2019', '/Scope': 'Army', '/Security_Class': 'UC', '/Signature': 'Yes', '/Subject': 'DA FORM 638, APR 2021', '/Suffix': '', '/Title': 'RECOMMENDATION FOR AWARD', '/Unicode': 'EMO'}
{}
The document you are trying to load requires Adobe Reader 8 or higher. You may not have the Adobe Reader installed or your viewing environment may not be properly configured to use Adobe Reader.   For information on how to install Adobe Reader and configure your viewing environment please see  http://www.adobe.com/go/pdf_forms_configure.

My question is: I am able to read other forms published by the same entity and same tool per the metadata, is there some way to rip into this one to extract the information? Link to PDF: https://armypubs.army.mil/pub/eforms/DR_a/ARN32485-DA_FORM_638-003-EFILE-4.pdf (this is an unrestricted, unclassified document - I'm simply trying to save time intending to read/write a lot of these en masse)

I did review similar question here: PDFMiner can't read pdf forms that require Adobe Acrobat but it seemed to be a false lead as I am using PyPDF, and I can open other fillable pdfs using this tool


Solution

  • Your document is a dynamic XFA form. These dynamic forms are defined entirely in XML and the PDF file serves as a container. The PDF file has a single page with the message you extracted, this is for the PDF processors that do not support dynamic XFA forms.

    Open the file with Adobe Reader and you will see a full PDF file with 3 pages. Open the file with SumatraPDF and you will see an empty PDF file just with the warning you got.

    Maybe PyPDF2 can work with XFA forms. If not, you will need a low level PDF tool to extract the XML streams.