Search code examples
powershellpdf

How to identify if a PDF file is a PDF/A?


I have a series of PDFs on disk, some of which are PDF/A's. I would like to identify which are PDF/A's, and potentially extract any PDF/A-specific metadata.

I am using powershell to pull the metadata into a csv, but I am not sure what a good approach is to extract the info of interest. I found PDFtk and ExifTool, which look like they will work, but was wondering if powershell is enough on it's own?

I don't want to build a PDF reader, and am happy to lean on 3rd party tools if powershell is a bad fit. I don't have info to help me trim down the search space, so will have to process all the PDFs. I have approx. 0.5TB to process, with PDF size ranging from kB to GB, with avg size at around 10MB.


Solution

  • PDF does not generally include the text "PDF/A" in its MetaData unless the writer placed such a string.

    Several PDF tools may try to find tags in the INFO or XMP but often cannot seek XML found due to PDF encoded compression or encryption.

    The PDF MetaData indicator for PDF/A is the "pdfaid:part" level and "pdfaid:conformance" (Not Conformance Level) of that. and ExifTool can seek the Tags for those keywords.

    For a ZUGFeRD-invoice 3A it could be:
    pdfaid:conformance="A" pdfaid:part="3"

    exiftool -a -U -g1  test_pdfa3a.pdf |FindStr "Part"
    

    should return

    Part                            : 3
    

    and |FindStr "Conformance"

    Conformance                     : A
    

    Note this is not a guarantee simply tags showing intent that may be tested for conformance at that Level.

    To then test a folder of such candidates we can run a verifier such as PDF/A consortiums VeraPDF checker.

    For a simpler test using verapdf.bat in your programming call it as a console command with a folder name.

    verapdf.bat --loglevel 0 --format text ..\..\*.pdf | find "PASS"
    

    Result should be something like:

    PASS C:\Users\Verify\Vera\rel\..\..\ave-pdf-verified1b.pdf
    PASS C:\Users\Verify\Vera\rel\..\..\deft-verified3b.pdf
    PASS C:\Users\Verify\Vera\rel\..\..\pdf-24-verified2b.pdf
    PASS C:\Users\Verify\Vera\rel\..\..\pdf-creator-verified_pdfA1b no icc.pdf
    

    NOTE: it does not change the filenames NOR like this show conformance number as I named those to keep track if which ones may be verifiable.

    Without | find pass there would be other results such as

    C:\Users\Verify\Vera\rel\..\..\bad Aspose-Repair_verified1b.pdf does not appear to be a valid PDF file and could not be parsed.
    C:\Users\Verify\Vera\rel\..\..\bad documentize-verified1b.pdf does not appear to be a valid PDF file and could not be parsed.
    C:\Users\Verify\Vera\rel\..\..\bad xodo-verified1b.pdf does not appear to be a valid PDF file and could not be parsed.
    FAIL C:\Users\Verify\Vera\rel\..\..\barcode.pdf
    FAIL C:\Users\Verify\Vera\rel\..\..\Boris-Doubrov-WCAG-or-PDF.pdf