Search code examples
pdfpdfboxapache-foppdfatagged-pdf

Find tagged content in PDF/A-1a using pdfbox


I have what I presume to be a PDF/A-1a file that was generated by apache fop and has an overlay letterhead put on using OverlayPDF from pdfbox. preflight recognizes the file as ok (but obviously only PDF/A-1b) and Acroreader says it is "PDF/A" mode and "Tagged: yes" in the document properties. I would like to see how that looks so I could maybe tweak fop into some small improvements.

My question is, where can I look to see the tagged content (i.e. the text representation of what in PDF is a kerned sequence of char outputs), preferably without coding myself, e.g. using the debugger/PDFReader from pdfbox? I'm a little lost there - is there an alternative way getting a textual output of the document structure e.g. into an xml file to search it using an editor? - TIA!

Edit

The letterhead(s) itself is originally postscript and converted to PDF/A-1b using ghostscript, then overlayed with

java -jar pdfbox-app-2.0.0-RC3.jar OverlayPDF letter_plain.pdf \
   followingpages_letterhead.pdf -first firstpage_letterhead.pdf \
   letter_with_head.pdf

The letter_plain.pdf is generated with fop using

fop -pdfprofile 'PDF/A-1a' -v -d -c my_fop_config.cfg -xml letter.xml \
   -xsl letter_to_fo.xsl -pdf letter_plain.pdf

The versions used are pdfbox 2.0 and fop 1.1.

In case the letter_with_head.pdf would no longer be PDF/A-1a then the question would apply to the letter_plain.pdf which should be 1a as per the fop call, would have to choose a different solution (like svg) to get the letterhead in then.

Edit 2

Example pdfs can be found here: https://www.magentacloud.de/share/j9qk7jfzyv - there is no need for a separate followingpages_letterhead.pdf as the sample is only one page.

Edit 3

I have suspicion that text is buried somewhere below Root/StructTreeRoot/ParentTree/Nums/[1]/[3]/P/P/P/P/P/P (assuming that the P's somehow map the fo:block's) but can't get nowhere showing text from the pdf.


Solution

  • The structure tree entries in the PDF at hand maps to marked content in the pages content stream. As an example the entry in

    Root/StructTreeRoot/K/[0]/K/[0]/K/[1]/K/[0]/K/[0]/K/[0]/K/[0]
    

    maps to this part of the pages content stream

    /Span << /MCID 0 >> BDC
      BT
        /F15 11 Tf
        1 0 0 -1 0 9.163 Tm
        [ (Bes) 15 (tell-Nr) 48 (. 1) 34 (23) 6 (456) 29 (7) 40 (8) ] TJ
      ET
    EMC
    

    As can be seen there is no additional definition so there is no easily displayable text other than parsing the TJoperator in this example sequence. So the tagging is used to define the structure of the document pointing to different building blocks only.

    In addition there is some information for Accessibility Support. But that's limited to specifying the Langattribute in the structure tree.