Search code examples
pdfpdfboxpreflightpacaccessible

How to heal inconsistent parent tree mappings in a PDF created by pdfBox


We are creating pdf documents in Java using pdfBox. Since they should be accessible by Screenreaders, we are using tags and we are setting up a parentTree and we add that to the document catalog.

Please find an example file here.

When we check the resulting pdf with PAC3 validator we get 25 errors for inconsistent entries in the structural parent tree.

enter image description here

Same result but more details in Adobe prefight syntax error check. The error message is

Inconsistent ParentTree mapping (ParentTree element 0) for structure element 
Traversal Path:->StructTreeRoot->K->K->[1]->K->[3]->K->[4]

Adobe preflight syntax error check Adobe preflight syntax error check

When i try to follow that traversal path in pdfBox Debugger, i see an element referencing the ID 22.

Now my questions are:

  1. What is the connection between the StructTreeRoot and the ParentTree?
  2. Where in the StructTreeRoot/ParentTree can i find the item with ID 22 that is refered to in node K->K->2->K->4->K->4? See image PDF Debugger
  3. What is that Parent Tree element 0 in the Preflight error message? See image Adobe preflight syntax error check

PDF Debugger PDF Debugger

I think, building accessible pdf with pdfBox as well as error messages from common validation tools are rather poorly documented. Or where can i find more information about it?

Thanks a lot for your help.


Solution

  • The issue in your PDF reminds very much of the issue discussed in the last section "Yet another issue with parent tree entries" in this answer to the question “Find Tag from Selection” is not working in tagged pdf? by fascinating coder:

    In your parent tree you do not reference the actual parent structure element of the MCID but you reference a new structure tree node which claims to have the actual parent node from the structure hierarchy as its own parent (not actually being one of its kids) and also claims to have the MCID in question as kid.

    Instead you should simply reference the actual parent structure element of the MCID.

    As your question title asks how to heal inconsistent parent tree mappings in a PDF created by pdfBox, here an approach to fix your parent tree by rebulding the parent tree from the structure tree.

    First recursively collect MCIDs and their parent structure tree elements by page, e.g. using a method like this:

    void collect(PDPage page, PDStructureNode node, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
        COSDictionary pageDictionary = node.getCOSObject().getCOSDictionary(COSName.PG);
        if (pageDictionary != null) {
            page = new PDPage(pageDictionary);
        }
    
        for (Object object : node.getKids()) {
            if (object instanceof COSArray) {
                for (COSBase base : (COSArray) object) {
                    if (base instanceof COSDictionary) {
                        collect(page, PDStructureNode.create((COSDictionary) base), parentsByPage);
                    } else if (base instanceof COSNumber) {
                        setParent(page, node, ((COSNumber)base).intValue(), parentsByPage);
                    } else {
                        System.out.printf("?%s\n", base);
                    }
                }
            } else if (object instanceof PDStructureNode) {
                collect(page, (PDStructureNode) object, parentsByPage);
            } else if (object instanceof Integer) {
                setParent(page, node, (Integer)object, parentsByPage);
            } else {
                System.out.printf("?%s\n", object);
            }
        }
    }
    

    (RebuildParentTreeFromStructure method)

    with this helper method

    void setParent(PDPage page, PDStructureNode node, int mcid, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
        if (node == null) {
            System.err.printf("Cannot set null as parent of MCID %s.\n", mcid);
        } else if (page == null) {
            System.err.printf("Cannot set parent of MCID %s for null page.\n", mcid);
        } else {
            Map<Integer, PDStructureNode> parents = parentsByPage.get(page);
            if (parents == null) {
                parents = new HashMap<>();
                parentsByPage.put(page, parents);
            }
            if (parents.containsKey(mcid)) {
                System.err.printf("MCID %s already has a parent. New parent rejected.\n", mcid);
            } else {
                parents.put(mcid, node);
            }
        }
    }
    

    (RebuildParentTreeFromStructure helper method)

    and then rebuild based on the collected information:

    void rebuildParentTreeFromData(PDStructureTreeRoot root, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
        int parentTreeMaxkey = -1;
        Map<Integer, COSArray> numbers = new HashMap<>();
    
        for (Map.Entry<PDPage, Map<Integer, PDStructureNode>> entry : parentsByPage.entrySet()) {
            int parentsId = entry.getKey().getCOSObject().getInt(COSName.STRUCT_PARENTS);
            if (parentsId < 0) {
                System.err.printf("Page without StructsParents. Ignoring %s MCIDs.\n", entry.getValue().size());
            } else {
                if (parentTreeMaxkey < parentsId)
                    parentTreeMaxkey = parentsId;
                COSArray array = new COSArray();
                for (Map.Entry<Integer, PDStructureNode> subEntry : entry.getValue().entrySet()) {
                    array.growToSize(subEntry.getKey() + 1);
                    array.set(subEntry.getKey(), subEntry.getValue());
                }
                numbers.put(parentsId, array);
            }
        }
    
        PDNumberTreeNode numberTreeNode = new PDNumberTreeNode(PDParentTreeValue.class);
        numberTreeNode.setNumbers(numbers);
        root.setParentTree(numberTreeNode);
        root.setParentTreeNextKey(parentTreeMaxkey + 1);
    }
    

    (RebuildParentTreeFromStructure method)

    Applied like this

    PDDocument document = PDDocument.load(SOURCE));
    rebuildParentTree(document);
    document.save(RESULT);
    

    (RebuildParentTreeFromStructure test testTestdatei)

    PAC3 and Adobe Preflight (at least of my old Acrobat 9.5) go all green for the result:

    PAC3 screenshot

    Adobe Preflight screenshot

    Beware: This is no generic parent tree rebuilder yet. It is made to work for the test file at hand with a specific kind of structure tree nodes and content only in page content streams. For a generic tool it has to learn to cope with other kinds, too, and to also process e.g. marked content in embedded XObjects.