Search code examples
pythonxsd

Parsing non-standard date element using xmlschema


I have an xsd schema file, which include the following definition:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
...
<xs:element name="CreateDate" minOccurs="0" maxOccurs="1">
    <xs:simpleType>
        <xs:restriction base="xs:string">
            <xs:minLength value="8"/>
            <xs:maxLength value="10"/>
        </xs:restriction>
    </xs:simpleType>
</xs:element>
...

And an xml file which includes the following element:

<?xml version="1.0" encoding="utf-8"?>
...
<!-- format is: YYYY/dd/mm -->
<CreateDate>2020/10/22</CreateDate>
...

I'm using xmlschema to parse the xml file like so:

schema = xmlschema.XMLSchema(schema_file)
element = schema.to_dict(xml_file, datetime_types=True)

Obviously, CreateDate is parsed to a string instead of a Date object. Questions:

  1. Is it possible to change the xsd definition, so that xmlschema automatically parses CreateDate to Date using format "YYYY/mm/dd"?
  2. If not, I guess I need intercept the parsing using value_hook or element_hook arguments to to_dict(), but I'm not sure how to go about it. Any suggestions?

Solution

  • I only find the hook option on iter_decode, here is an example that just assumes a single element schema and instance:

    from pprint import pprint
    
    import xmlschema
    
    from elementpath import datatypes
    
    from datetime import datetime
    
    schema = xmlschema.XMLSchema('schema1.xsd')
    
    def my_element_hook(elementData, xsdElement, xsdType):
        thisDate = datetime.strptime(elementData.text, '%Y/%m/%d')
        return xmlschema.ElementData(tag=elementData.tag,text=datatypes.Date10(thisDate.year, thisDate.month, thisDate.day),attributes=None,content=None)
    
    
    for value in schema.iter_decode('sample1.xml',datetime_types=True,element_hook=my_element_hook):
        pprint(value)
    

    For a sample sample1.xml like

    <?xml version="1.0" encoding="utf-8"?>
    <!-- format is: YYYY/mm/dd -->
    <CreateDate>2020/10/22</CreateDate>
    

    and a schema

    <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    
    <xs:element name="CreateDate">
        <xs:simpleType>
            <xs:restriction base="xs:string">
                <xs:minLength value="8"/>
                <xs:maxLength value="10"/>
            </xs:restriction>
        </xs:simpleType>
    </xs:element>
    
    </xs:schema>
    

    I get e.g. Date10(2020, 10, 22).

    For a more complex schema I guess the element hook needs to return elementData unchanged for the elements you don't want to manipulate and then for e.g. elementData.tag = 'CreateDate' use the presented code e.g.

    def my_element_hook(elementData, xsdElement, xsdType):
        if elementData.tag == 'CreateDate':
            thisDate = datetime.strptime(elementData.text, '%Y/%m/%d')
            return xmlschema.ElementData(tag=elementData.tag,text=datatypes.Date10(thisDate.year, thisDate.month, thisDate.day),attributes=None,content=None)
        else:
            return elementData