Search code examples
pythonxmlxsdlxmlpyxb

How can I parse an XML document into a Python object?


I'm trying to consume an XML API. I'd like to have some Python objects that represent the XML data. I have several XSD and some example API responses from the documentation.

Here's one example XML response:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<serial:serialHeaderType xmlns:isan="http://www.isan.org/ISAN/isan"
                         xmlns:title="http://www.isan.org/schema/v1.11/common/title"
                         xmlns:serial="http://www.isan.org/schema/v1.21/common/serial"
                         xmlns:externalid="http://www.isan.org/schema/v1.11/common/externalid"
                         xmlns:common="http://www.isan.org/schema/v1.11/common/common"
                         xmlns:participant="http://www.isan.org/schema/v1.11/common/participant"
                         xmlns:language="http://www.isan.org/schema/v1.11/common/language"
                         xmlns:country="http://www.isan.org/schema/v1.11/common/country">
    <common:status>
        <common:DataType>SERIAL_HEADER_TYPE</common:DataType>
        <common:ISAN root="0000-0002-3B9F"/>
        <common:WorkStatus>ACTIVE</common:WorkStatus>
    </common:status>
    <serial:SerialHeaderId root="0000-0002-3B9F"/>
    <serial:MainTitles>
        <title:TitleDetail>
            <title:Title>Braquo</title:Title>
            <title:Language>
                <language:LanguageLabel>French</language:LanguageLabel>
                <language:LanguageCode>
                    <language:CodingSystem>ISO639_2</language:CodingSystem>
                    <language:ISO639_2Code>FRE</language:ISO639_2Code>
                </language:LanguageCode>
            </title:Language>
            <title:TitleKind>ORIGINAL</title:TitleKind>
        </title:TitleDetail>
    </serial:MainTitles>
    <serial:TotalEpisodes>11</serial:TotalEpisodes>
    <serial:TotalSeasons>0</serial:TotalSeasons>
    <serial:MinDuration>
        <common:TimeUnit>MIN</common:TimeUnit>
        <common:TimeValue>45</common:TimeValue>
    </serial:MinDuration>
    <serial:MaxDuration>
        <common:TimeUnit>MIN</common:TimeUnit>
        <common:TimeValue>144</common:TimeValue>
    </serial:MaxDuration>
    <serial:MinYear>2009</serial:MinYear>
    <serial:MaxYear>2009</serial:MaxYear>
    <serial:MainParticipantList>
        <participant:Participant>
            <participant:FirstName>Frédéric</participant:FirstName>
            <participant:LastName>Schoendoerffer</participant:LastName>
            <participant:RoleCode>DIR</participant:RoleCode>
        </participant:Participant>
        <participant:Participant>
            <participant:FirstName>Karole</participant:FirstName>
            <participant:LastName>Rocher</participant:LastName>
            <participant:RoleCode>ACT</participant:RoleCode>
        </participant:Participant>
    </serial:MainParticipantList>
    <serial:CompanyList>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>R.T.B.F.</common:CompanyName>
        </common:Company>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>Capa Drama</common:CompanyName>
        </common:Company>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>Marathon</common:CompanyName>
        </common:Company>
    </serial:CompanyList>
</serial:serialHeaderType>

I tried simply ignoring the XSD and using lxml.objectify on the XML I'd get from the API. I had a problem with namespaces. Having to refer to every child node with its explicit namespace was a real pain and doesn't make for readable code.

from lxml import objectify
obj = objectify.fromstring(response)
print obj.MainTitles.TitleDetail
# This will fail to find the element because you need to specify the namespace
print obj.MainTitles['{http://www.isan.org/schema/v1.11/common/title}TitleDetail']
# Or something like that, I couldn't get it to work, and I'd much rather use attributes and not specify the namespace

So then I tried generateDS to create some Python class definitions for me. I've lost the error messages that this attempt gave me but I couldn't get it to work. It would generate a module for each XSD that I gave it but it wouldn't parse the example XML.

I'm now trying pyxb and this seems much nicer so far. It's generating nicer definitions than generateDS (splitting them into multiple, reusable modules) but it won't parse the XML:

from models import serial
obj = serial.CreateFromDocument(response)

Traceback (most recent call last):
  ...
  File "/vagrant/isan/isan.py", line 58, in lookup
    return serial.CreateFromDocument(resp.content)
  File "/vagrant/isan/models/serial.py", line 69, in CreateFromDocument
    instance = handler.rootObject()
  File "/home/vagrant/venv/lib/python2.7/site-packages/pyxb/binding/saxer.py", line 285, in rootObject
    raise pyxb.UnrecognizedDOMRootNodeError(self.__rootObject)
UnrecognizedDOMRootNodeError: <pyxb.utils.saxdom.Element object at 0x2b53664dc850>

The unrecognised node is the <serial:serialHeaderType> node from the example. Looking at the pyxb source it seems that this error comes about "if the top-level element got processed as a DOM instance" but I don't know what this means or how to prevent it.

I've run out of steam for trying to explore this, I don't know what to do next.


Solution

  • UnrecognizedDOMRootNodeError indicates that PyXB could not locate the element in a namespace for which it has bindings registered. In your case it fails on the first element, which is {http://www.isan.org/schema/v1.21/common/serial}serialHeaderType.

    The schema for that namespace defines a complexType named SerialHeaderType but does not define an element with the name serialHeaderType. In fact it defines no top-level elements. So PyXB can't recognize it, and the XML does not validate.

    Either there's an additional schema for the namespace that you'll need to locate which provides elements, or the message you're sending really doesn't validate. That may be because somebody's expecting a implicit mapping from a complex type to an element with that type, or because it's a fragment that would normally be found within some other element where that QName is a member element name.

    UPDATE: You can hand-craft an element in that namespace by adding the following to the generated bindings in serial.py:

    serialHeaderType = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace, 'serialHeaderType'), SerialHeaderType)
    Namespace.addCategoryObject('elementBinding', serialHeaderType.name().localName(), serialHeaderType)
    

    If you do that, you won't get the UnrecognizedDOMRootNodeError but you will get an IncompleteElementContentError at:

    <common:status>
        <common:DataType>SERIAL_HEADER_TYPE</common:DataType>
        <common:ISAN root="0000-0002-3B9F"/>
        <common:WorkStatus>ACTIVE</common:WorkStatus>
    </common:status>
    

    which provides the following details:

    The containing element {http://www.isan.org/schema/v1.11/common/common}status is defined at common.xsd[243:3].
    The containing element type {http://www.isan.org/schema/v1.11/common/common}StatusType is defined at common.xsd[289:1]
    The {http://www.isan.org/schema/v1.11/common/common}StatusType automaton is not in an accepting state.
    Any accepted content has been stored in instance
    The following element and wildcard content would be accepted:
        An element {http://www.isan.org/schema/v1.11/common/common}ActiveISAN per common.xsd[316:3]
        An element {http://www.isan.org/schema/v1.11/common/common}MatchingISANs per common.xsd[317:3]
        An element {http://www.isan.org/schema/v1.11/common/common}Description per common.xsd[318:3]
    No content remains unconsumed
    

    Reviewing the schema confirms that, at a minimum, a {http://www.isan.org/schema/v1.11/common/common}Description element is missing but required.

    So it seems these documents are not meant to be validated, and PyXB is probably the wrong technology to use.