Search code examples
javamergepdfboxpdfatagged-pdf

Merge two PDF/A result should be a valid PDF/A too pdfbox


I'm using pdfbox to merge two PDF/A

Right now my code look like this:

    PDFMergerUtility mergerUtility = new PDFMergerUtility();

    File file = new File("example/c.pdf");

    mergerUtility.addSource(new File("example/a.pdf"));
    mergerUtility.addSource(new File("example/b.pdf"));

    mergerUtility.setDestinationFileName(file.getAbsolutePath());

    try {
        mergerUtility.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
    } catch (IOException ex) {
        throw new RuntimeException("Unable to merge", ex);
    }

    File inputFile = new File("example/c.pdf");
    PDDocument doc = PDDocument.load(inputFile);

    File orig = new File("example/a.pdf");
    PDDocument origDoc = PDDocument.load(orig);
    File orig2 = new File("example/b.pdf");
    PDDocument orig2Doc = PDDocument.load(orig2);
    PDStructureTreeRoot treeRoot = origDoc.getDocumentCatalog().getStructureTreeRoot();
    PDStructureTreeRoot treeRoot2 = orig2Doc.getDocumentCatalog().getStructureTreeRoot();
    treeRoot.setKids(treeRoot2.getKids());
    doc.getDocumentCatalog().setStructureTreeRoot(treeRoot);

    List<PDOutputIntent> outputIntents=new ArrayList<>();
    outputIntents.add(doc.getDocumentCatalog().getOutputIntents().get(0));
    doc.getDocumentCatalog().setOutputIntents(outputIntents);
    doc.save("example/d.pdf");
    doc.close();

by setting OutputIntent the same as the first page (so that d.pdf got only one) I already solved a lot of problems... the last problem that I got (only on this validator https://avepdf.com/pdfa-validation ) was 4:

"Non-standard structure type is not mapped to any functionally equivalent standard type."

I was able to identify them and they are due to the use of "THead" and "TBody" on my result (generated by pdfbox).

I was able to cut that error in half (now I only got 2 "Non-standard structure type is not mapped to any functionally equivalent standard type.") by using the same StructureTreeRoot as one of the original file (that's the "final" code that you can see, still not able to get a valid PDF/A from a merge of two PDF/A)... but I don't know how to merge those two StructureTreeRoot or if that's even the real solution (maybe there is a way to tell pdfbox just to avoid using THead and TBody instead).

The result is already good as it pass most pdf/a validator out there, I just need that it pass also that validator (since it is the one used by the company I work for)... Also I don't think the validator is to blame since both input files pass as valid PDF/A file.

Got any ideas?

PS. I found a way to merge the two PDStructureTreeRoot still not working... but I updated the code. adding

PDStructureTreeRoot treeRoot2 = orig2Doc.getDocumentCatalog().getStructureTreeRoot();
treeRoot.setKids(treeRoot2.getKids());

Solution

  • Ok... this was really tough. For the first time I had to get help from an chatgpt, but it was still not enough because of course the AI was not able to create large amount of working code, each time I had to make corrections, it was still a good help to translate what I had in mind with code samples (a kind of protocode). Anyway the most important thing that I had to understand is that it was not possible to solve my problem by cloning the StructureTreeRoot (but that's in my edit) so I deleted that part of my code, no reason to keep it. I did not understand how to avoid the creation of THead or TBody. So the only thing that I could do was to replace every single one of them with a standard alternative like P.

    The first problem was how to get the root in a way that I could iterate on it. To understand that I started by debugging my tree and access it manually until I reached a THead... I had some direction thanks to the structure that I could read with Visual Studio Code inside of the generated PDF. The AI was a little help here to understand how to access some variables that were protected and that I could only access in debug mode.

    COSDictionary catalogDict = doc2.getDocumentCatalog().getCOSObject();
    COSObject structTreeRootRef = (COSObject) catalogDict.getItem(COSName.STRUCT_TREE_ROOT);
    COSDictionary structTreeRootDict = (COSDictionary) structTreeRootRef.getObject();
    COSBase result = structTreeRootDict.getItem(COSName.K);
    COSDictionary dict1 = result instanceof COSObject ? (COSDictionary) ((COSObject) result).getObject() : null;
    //this
    COSArray array = result instanceof COSArray ? (COSArray) result : new COSArray();
    //or
    COSBase result1 = dict1.getItem(COSName.K);
    COSArray array = result1 instanceof COSArray ? (COSArray) result1 : new COSArray();
    

    I don't have the original code but basically after this first part I just did a get(i) because I knew each node that I needed to get and finally

    subDict.setItem(COSName.S, COSName.getPDFName("P"));
    

    Of course that was already working code (and that was great since to get to that point I had to learn how to access the pdf tree and pdfbox is not intuitive on that end at all) but of course I was not done because my solution was working only for two pdf/a of my own example. So I decided to turn the get(i) in for loops. Better but still not a great solution because when I tried another pdf where my THead and TBody where one layer deeper I had to add another for loop inside to make it working again... and of course the performance were not great. That's where chatgpt helped again by presenting a recursive alternative (the solution was trivial but honestly said it hadn't occurred to me at all)... I still had to correct the code a lot but in the end the correct solution was this:

        COSDictionary catalogDict = doc2.getDocumentCatalog().getCOSObject();
        COSObject structTreeRootRef = (COSObject) catalogDict.getItem(COSName.STRUCT_TREE_ROOT);
        COSDictionary structTreeRootDict = (COSDictionary) structTreeRootRef.getObject();
        COSName newName = COSName.getPDFName("P");
    
        updateStructureTree(structTreeRootDict, newName);
    

    recursive Method:

    private static void updateStructureTree(COSDictionary dict, COSName newName) {
        COSBase result = dict.getItem(COSName.K);
        COSDictionary dict1 = result instanceof COSObject ? (COSDictionary) ((COSObject) result).getObject() : null;
    
        if(dict1 != null){
            COSBase result1 = dict1.getItem(COSName.K);
            COSArray array = result1 instanceof COSArray ? (COSArray) result1 : new COSArray();
    
            for (COSBase resultItem : array) {
                COSDictionary subDict = resultItem instanceof COSObject ?
                        (COSDictionary) ((COSObject) resultItem).getObject() :
                        new COSDictionary();
    
                if (subDict.getItem(COSName.S) != null &&
                        (subDict.getItem(COSName.S).equals(COSName.getPDFName("THead")) ||
                                subDict.getItem(COSName.S).equals(COSName.getPDFName("TBody")))) {
                    subDict.setItem(COSName.S, newName);
                }
    
                updateStructureTree(subDict, newName);
            }
        }else{
            COSArray array = result instanceof COSArray ? (COSArray) result : new COSArray();
    
            for (COSBase resultItem : array) {
                COSDictionary subDict = resultItem instanceof COSObject ?
                        (COSDictionary) ((COSObject) resultItem).getObject() :
                        new COSDictionary();
    
                if (subDict.getItem(COSName.S) != null &&
                        (subDict.getItem(COSName.S).equals(COSName.getPDFName("THead")) ||
                                subDict.getItem(COSName.S).equals(COSName.getPDFName("TBody")))) {
                    subDict.setItem(COSName.S, newName);
                }
    
                updateStructureTree(subDict, newName);
            }
        }
    }
    

    PS I found an alternative: just create a List and

    mergerUtility.setDocumentMergeMode(PDFMergerUtility.DocumentMergeMode.OPTIMIZE_RESOURCES_MODE);
    for (int i = 1; i < lists.size(); i++) {
        PDDocument currentDoc = PDDocument.load(lists.get(i));
    
        mergerUtility.appendDocument(docC, currentDoc);
    }
    

    and save your new docs... Using appendDocument instead of mergeDocuments will not invalidate a PDF/A even if it got THead and TBody... and it will also not change the header version.