Search code examples

Retrieving XMP metadata from PDF files with Python xmptools

I would like to use Python to retrieve metadata stored in PDF files. I am trying to use Python xmptools, but find that I cannot extract all the metadata. For example, this paper is available in PDF format. I have the following script that tries to extract the metadata

from xmptools import XMPMetadata, DC
xmp = XMPMetadata.fromFile("Leonard_2015_Comment_on_‘Dimensionless_units_in_the_SI’.pdf")[0]
print( xmp.getContainerItems(DC.publisher) )

This works fine. The result is [rdflib.term.Literal('IOP Publishing')]. However, if I change the last line to

print( xmp.getContainerItems(DC.identifier) )

then I get None as a result.

I think this may be due to the XML inside the PDF file. The data concerned with these two queries are

               <rdf:li>IOP Publishing</rdf:li>

In the case of publisher, the information is wrapped in RDF tags, but that is not the case for identifier.

Is there a way for xmptools to read simple entries where RDF tags have not been used?


  • pypdf is able to access pdf metadata. Specific attributes are listed out of the box or the root minidom object can be obtained and iterated

    from pypdf import PdfReader
    fd = open("/home/lmc/tmp/shapes.pdf", "rb")
    reader = PdfReader(fd)
    meta = reader.xmp_metadata   



    Getting the root minidom object

    meta = reader.xmp_metadata
    root = meta.rdf_root


    <class 'xml.dom.minidom.Element'>
    <rdf:RDF xmlns:rdf="">
      <rdf:Description xmlns:pdfaid="" rdf:about="">
      <rdf:Description xmlns:dc="" rdf:about="">
       <!-- redacted -->

    Getting specific elements

    for node in root.getElementsByTagName('xmp:ModifyDate'):
        print(node.firstChild.nodeValue, node.toxml())
    for node in root.getElementsByTagNameNS('', 'ModifyDate'):
        print(node.firstChild.nodeValue, node.toxml())


    2024-05-06T19:20:03-03:00 <xmp:ModifyDate>2024-05-06T19:20:03-03:00</xmp:ModifyDate>
    2024-05-06T19:20:03-03:00 <xmp:ModifyDate>2024-05-06T19:20:03-03:00</xmp:ModifyDate>

    Additionally, using pyxml2xpath, get all xpath expressions from metadata (XML) to know what elements are present without parsing element by element

    # pip install pyxml2xpath==0.3.3
    from xml2xpath import xml2xpath
    tree, ns, xmap = xml2xpath.fromstring(root.toxml())
    # get specific element
    mod_date = tree.xpath('//rdf:Description/xmp:ModifyDate', namespaces=ns)[0]
    print('ModifyDate', mod_date.text)
    # print all found elements
    xml2xpath.print_xpaths(xmap, 'all')

    Result (redacted)

    ModifyDate 2024-05-06T19:20:03-03:00
    Found  38 xpath expressions for elements
    Found   7 xpath expressions for attributes