Search code examples
pdfborb

Merge annotations of PDF files using Python `borb` library


First create a simple document:

from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import SingleColumnLayout
from borb.pdf import Paragraph
from borb.pdf import PDF

# create an empty Document
pdf = Document()

# add an empty Page
page = Page()
pdf.add_page(page)

# use a PageLayout (SingleColumnLayout in this case)
layout = SingleColumnLayout(page)

# add a Paragraph object
layout.add(Paragraph("Hello World!"))

# store the PDF
with open("output.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, pdf)

Now I annotate it with two different programs, just for testing.

  1. highlight yellow the word "hello" using Linux Evince PDF reader 3.36.10
  2. underline red using Firefox built-in PDF reader

Result looks like this:

enter image description here

Then I try this to extract annotations:

from borb.pdf import Document
from borb.pdf import PDF

# Read the document
doc = None
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle)

for i in range(0,30):
    annots = doc.get_page(i).get_annotations()
    print(">>> ", annots)

Here are test files and test results:

  1. Plain, no annotation output.pdf
  2. Annotated by Evince, but annotations not found output1.pdf
  3. Annotated by Firefox, but page(0) index out of range error output2.pdf
  4. Annotated by Firefox and added a page with PDFTK, now able to read annotations output3.pdf
  5. A real paper I downloaded and annotated, fail to read any annotations test-paper.pdf

Solution

  • As K J already hinted at in comments, the first issue here is that borb does not seem to add an EOL character (sequence) at the end of the %%EOF end-of-file marker line of a PDF and evince does not add an EOL at the beginning of an incremental update.

    These two behaviors combined cause the first object of the incremental update by evince to start on the final %%EOF line by borb which effectively puts it onto a comment line.

    Most PDF processors nonetheless can properly read that first object in the incremental update because objects in a PDF have to be located by using the cross reference tables at the respective ends of the document revisions; these tables indicate the file position at which the object starts, so a PDF processor simply following the cross reference table and parsing from there, will not be disturbed by the preceding %%EOF.

    Some PDF processors, though, don't use the cross references but simply parse the file start-to-end, so they run into an error here (which either makes them stop parsing the PDF or which they try to ignore by reading on); this is also done by most PDF processors when trying to repair PDFs with broken cross reference tables. For regular access, though, this is the wrong approach.

    There may even be PDF processors that use the position from the cross reference table but then explicitly check whether right before that there is something fishy; these parsers also will run into an error here or at least emit a warning.


    The second issue is that in output2.pdf the last revision uses a cross reference stream while the preceding revisions use cross reference tables. This is incorrect, the only way to mix cross reference types is by using hybrid cross references.

    Some PDF processors may accept this mix-up, others won't.


    To help improve interoperability, you may want to ask the developers of the PDF processors in question to improve their products:

    • borb should add an EOL after the final %%EOF;
    • evince should first add an EOL when adding an incremental update, at least if there is none at the end of the previous revision;
    • firefox must stop mixing cross reference types.

    This question also initiated a discussion in a PDF Association thread here: https://github.com/pdf-association/pdf-issues/issues/112