pythonpdfannotationsextractpypdf

Extract annotations by layer from a PDF in Python


I have a PDF with annotations (markups) stored in different layers. Each layer has a specific name. I need to extract the annotations with their layer name. In particular, I'm interested only in the location of the annotation (as in, the bounding box of it) and the name of their layer, i.e. an output like:

{ "layerName": "myLayer01", "location" : [ 10, 5, 4, 2 ] }

Using a library like pyPDF2 (I'm the latest v3.0.1), I can extract the annotations' location using this:

from PyPDF2 import PdfReader
reader = PdfReader("myFile.pdf")

for page in reader.pages:
    if "/Annots" in page:
        for annot in page["/Annots"]:
            obj = annot.get_object()
            annotation = { "layerName": ???, "location": obj["/Rect"] } # how do I get the layer Name?

While it's easy to get the location, I am struggling to figure out how to get the layerName of the annotation.

If I look into the properties of the extracted obj (for example serializing it entirely with jsonPickle and CTRL+F in the entire result) I cannot find any mention of the layer the annotations are located on.

I know it's possible to get a list of all existing layers with something like:

# Get the first page of the PDF and its layers
page = pdf_reader.getPage(0)
layers = page['/OCProperties']['/OCGs']

but this doesn't help grouping the annotations per layer.

Any suggestion is appreciated. I'd prefer a concise solution, using also libraries different than pyPDF if helpful.


Solution

  • This is easy in PyMuPDF.

    import fitz  # PyMuPDF
    from pprint import pprint
    
    doc = fitz.open("input.pdf")
    for page in doc:
        for annot in page.annots():
            oc_xref = annot.get_oc()  # xref of its OCG or OCMD
            if oc_xref > 0:  # it indeed has an OCG/OCMD
                ocg_dict = doc.get_ocgs()[oc_xref]  # describes the OCG's properties
                pprint(ocg_dict)
    
    # the output would be somethink like this:
    {'on': True,
    'intent': ['View', 'Design'],
    'name': 'Circle',
    'usage': 'Artwork'}
    
    ...