Search code examples
pdfitextitext7

When merging pdfs on certain order getting "Tag structure copying failed"


I tried to merge multiple pdf in a specific order, and to do that I followed this general approach like this:-

//filesToMergeList contains files in specific order in which they will be merged
val firstFileInMergeList: File = filesToMergeList[0]

val pdfReader = PdfReader(firstFileInMergeList)

val pdfWriter = PdfWriter(destinationFile)

val pdfDocument = PdfDocument(pdfReader, pdfWriter);

val merger = PdfMerger(pdfDocument);

filesToMergeList.forEachIndexed { index, element ->
        if (index > 0) {

             val pdfDocument2 = PdfDocument(PdfReader(filesToMergeList[index]))
                    
             merger.merge(pdfDocument2, 1, pdfDocument2.numberOfPages)
                    
             pdfDocument.flushCopiedObjects(pdfDocument2);

             pdfDocument2.close();
        }
}

pdfDocument.close()

The above approach worked great for most documents but I discovered an issue in itext7:- When I first tried to merge documents in this zip in this order:-

B.Tech-Cyber-180-Credits.docx.pdf
invoice.pdf
10840.pdf

My files got merged successfully

On the second try, I changed the order of files while merging to this:-

10840.pdf
invoice.pdf
B.Tech-Cyber-180-Credits.docx.pdf

Which gave me this error:

com.itextpdf.kernel.exceptions.PdfException: Tag structure copying failed: it might be corrupted in one of the documents.

Stacktrace:

W/System.err: com.itextpdf.kernel.exceptions.PdfException: Tag structure copying failed: it might be corrupted in one of the documents.
W/System.err:     at com.itextpdf.kernel.pdf.PdfDocument.copyPagesTo(PdfDocument.java:1316)
W/System.err:     at com.itextpdf.kernel.pdf.PdfDocument.copyPagesTo(PdfDocument.java:1366)
W/System.err:     at com.itextpdf.kernel.pdf.PdfDocument.copyPagesTo(PdfDocument.java:1345)
W/System.err:     at com.itextpdf.kernel.utils.PdfMerger.merge(PdfMerger.java:140)
W/System.err:     at com.itextpdf.kernel.utils.PdfMerger.merge(PdfMerger.java:117)
W/System.err:     at com.example.jetpack_compose_pick_edit_save_pdf_itext7_example.MergePDFsKt$mergePDFs$2.invokeSuspend(MergePDFs.kt:255)
W/System.err:     at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
W/System.err:     at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
W/System.err:     at kotlinx.coroutines.internal.LimitedDispatcher.run(LimitedDispatcher.kt:42)
W/System.err:     at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:95)
W/System.err:     at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:570)
W/System.err:     at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:749)
W/System.err:     at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:677)
W/System.err:     at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:664)
W/System.err: Caused by: java.lang.NullPointerException: Attempt to invoke virtual method 'com.itextpdf.kernel.pdf.PdfName com.itextpdf.kernel.pdf.tagging.PdfStructElem.getRole()' on a null object reference
W/System.err:     at com.itextpdf.kernel.pdf.tagutils.TagStructureContext.normalizeDocumentRootTag(TagStructureContext.java:479)
W/System.err:     at com.itextpdf.kernel.pdf.PdfDocument.copyPagesTo(PdfDocument.java:1314)
W/System.err:   ... 13 more

So can anyone explain the reason behind this weird behavior and how can it be avoided.

Also, if it is a bug in itext7, how and where can I report it.


Solution

  • The problem arises because your input documents are a mixture of tagged and untagged documents: 10840.pdf is not tagged while the other two are tagged.

    By default the PdfMerger attempts to keep the structure information from tagged PDFs but it does some sanity checks. These checks require parts of the structure information of the first tagged source in the merge to be kept in memory for comparisons.

    In your code, though, if filesToMergeList[0] is not tagged but later source files are tagged, these structure information of the first tagged merge source will be flushed in your loop and, therefore, be removed from memory. This happens in your bad case, resulting in a NullPointerException when iText tries to access those flushed information.

    Three workarounds pop to my mind:

    1. Don't flush and close the documents to merge but keep them in memory.

      This of course requires additional memory which I assume you wanted to avoid.

    2. Make sure you merge only properly tagged or only completely untagged documents; or at least make sure that if at least one of the documents to merge is tagged, the first document in your list is tagged.

      This may of course contradict your task to merge arbitrary collections of documents in an arbitrary order.

    3. Make PdfMerger ignore tagging information. You can do that by replacing

      val merger = PdfMerger(pdfDocument);
      

      by

      val merger = PdfMerger(pdfDocument, false, true);
      

      Keeping tagging information from a mixed collection of tagged and un-tagged sources is of limited use anyways.

    4. Having seen these three options you came up with a fourth one: Use an additional tagged pdf (e.g. an empty, single page one you can easily create with iText) as first source pdf and after merging remove its pages.