Search code examples
pythonxmlxsdauto-generate

generate a full XML file with python


I need to machine generate (not to parse!) a (possibly complex) XML file in Python.

I am (relatively) familiar with xml and lxml modules, but it's unclear to me how to generate a valid XML checked against an xsl spec.

What I need to build is something like:

<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf" unique-identifier="uuid_id" version="2.0">
    <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
        <dc:identifier opf:scheme="calibre" id="calibre_id">4117</dc:identifier>
        <dc:identifier opf:scheme="uuid" id="uuid_id">d06a2234-67b4-40db-8f4a-136e52057101</dc:identifier>
        <dc:title>La Fine Di Alice</dc:title>
        <dc:creator opf:file-as="Homes, A. M." opf:role="aut">A. M. Homes</dc:creator>
        <dc:contributor opf:file-as="calibre" opf:role="bkp">calibre (3.10.0) [https://calibre-ebook.com]</dc:contributor>
        <dc:date>2005-11-15T00:00:00+00:00</dc:date>
        <dc:publisher>Minimum Fax</dc:publisher>
        <dc:identifier opf:scheme="ISBN">9788875210649</dc:identifier>
        <dc:language>en</dc:language>
        <meta content="{&quot;A. M. Homes&quot;: &quot;&quot;}" name="calibre:author_link_map"/>
        <meta content="2017-11-07T07:34:41.217796+00:00" name="calibre:timestamp"/>
        <meta content="La Fine Di Alice" name="calibre:title_sort"/>
    </metadata>
    <guide>
        <reference href="cover.jpg" title="Cover" type="cover"/>
    </guide>
</package>

The complete grammar is here.

I tried something along the lines:

from lxml import etree as ET

et = ET.Element('package', attrib={'version': "2.0", 'xmlns': "http://www.idpf.org/2007/opf", 'unique-identifier': "BookId"})
md = ET.SubElement(et, 'metadata', attrib={'xmlns:dc': "http://purl.org/dc/elements/1.1/", 'xmlns:opf': "http://www.idpf.org/2007/opf"})
au = ET.SubElement(md, 'dc:title')
au.text = bk['Title']

s = ET.tostring(et, pretty_print=True)

... but it fails miserably with: "ValueError: Invalid attribute name 'xmlns:dc'"

Any pointer welcome.


Solution

  • Reference: http://lxml.de/tutorial.html#namespaces

    You mustn't specify the namespace as a :-qualified string. Instead you may use {http://url.url/url}tag form, or QName form.

    Here is your program, using namespaces:

    from lxml import etree as ET
    
    NS_DC = "http://purl.org/dc/elements/1.1/"
    NS_OPF = "http://www.idpf.org/2007/opf"
    nsmap = {
        "dc": NS_DC,
        None: NS_OPF,
    }
    PACKAGE = ET.QName(NS_OPF, 'package')
    METADATA = ET.QName(NS_OPF, 'metadata')
    TITLE = ET.QName(NS_DC, 'title')
    
    et = ET.Element(PACKAGE,
                    attrib={'version': "2.0",
                            'unique-identifier': "BookId"},
                    nsmap=nsmap)
    
    md = ET.SubElement(et, METADATA)
    au = ET.SubElement(md, TITLE)
    au.text = "A Tale of Two Cities"
    
    s = ET.tostring(et, pretty_print=True)
    print(s.decode('utf-8'))
    


    You might choose to use lxml.builder.ElementMaker. Here is a program that creates a portion of your example.

    Notes:

    • Note the use of dicts to represent attribute names that are not valid Python names.
    • Note the use of QName to represent namespace-qualified names.
    • Note the use of validator to validate the resulting tree.

     

    from lxml import etree as ET
    from lxml.builder import ElementMaker
    
    NS_DC = "http://purl.org/dc/elements/1.1/"
    NS_OPF = "http://www.idpf.org/2007/opf"
    SCHEME = ET.QName(NS_OPF, 'scheme')
    FILE_AS = ET.QName(NS_OPF, "file-as")
    ROLE = ET.QName(NS_OPF, "role")
    opf = ElementMaker(namespace=NS_OPF, nsmap={"opf": NS_OPF, "dc": NS_DC})
    dc = ElementMaker(namespace=NS_DC)
    validator = ET.RelaxNG(ET.parse("opf-schema.xml"))
    
    tree = (
        opf.package(
            {"unique-identifier": "uuid_id", "version": "2.0"},
            opf.metadata(
                dc.identifier(
                    {SCHEME: "uuid", "id": "uuid_id"},
                    "d06a2234-67b4-40db-8f4a-136e52057101"),
                dc.creator({FILE_AS: "Homes, A. M.", ROLE: "aut"}, "A. M. Homes"),
                dc.title("My Book"),
                dc.language("en"),
            ),
            opf.manifest(
                opf.item({"id": "foo", "href": "foo.pdf", "media-type": "foo"})
            ),
            opf.spine(
                {"toc": "uuid_id"},
                opf.itemref({"idref": "uuid_id"}),
            ),
            opf.guide(
                opf.reference(
                    {"href": "cover.jpg", "title": "Cover", "type": "cover"})
            ),
        )
    )
    validator.assertValid(tree)
    
    print(ET.tostring(tree, pretty_print=True).decode('utf-8'))