Search code examples
python-3.xunicodeencodingutf-8pyxb

PyXB: generating class names in Unicode


Can somebody point me to the right direction as I'm unable to generate binding classes with PyXB when element names are non ASCII?

The minimal reproducible example:

<?xml version="1.0" encoding="utf8"?>
<xs:schema elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="Address">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="Country" type="xs:string" />
        <xs:element name="Street" type="xs:string" />
        <xs:element name="Town" type="xs:string" />       
        <xs:element name="Дом" type="xs:string" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

(look for the <xs:element name="Дом" type="xs:string" /> where I use cyrillic. The encoding of the file is utf8. However, when I try:

pyxbgen -u example.xsd -m example

I got the error:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/lib/python3.5/xml/sax/expatreader.py", line 210, in feed
    self._parser.Parse(data, isFinal)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 9, column 26

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/bin/pyxbgen", line 52, in <module>
    generator.resolveExternalSchema()
.......

which points to the cyrillic name of the element. What am I missing?


Solution

  • UTF8 is spelled "utf-8" in XML and in Python.

    lilith[33]$ head -1 /tmp/cyr.xsd 
    <?xml version="1.0" encoding="utf-8"?>
    lilith[34]$ pyxbgen -u /tmp/cyr.xsd -m cyr
    WARNING:pyxb.binding.generate:Element use None.Дом renamed to emptyString
    Python for AbsentNamespace0 requires 1 modules
    

    That PyXB generates an element named emptystring instead of one named Дом is problem, though. PyXB was designed long before Python 3 and unicode support, and it goes to great effort to convert text to valid Python 2 identifiers.

    Since you're using Python 3 it should be possible to bypass that conversion, but it's not quite trivial. Track issue 67, or if there's a Cyrillic transliteration you prefer the technique demonstrated here for Japanese might work.