Search code examples
javapythonxmlxstream

Error in Python-generated XML


I'm generating some XML code with Python. This code counts the number of occurences of a word in a corpus and matches that word to a number (a probability distribution).

Here's a sample a little of the XML:

<?xml version="1.0" encoding="UTF-8" ?>
    <root>
        <Durapipe type="int">1</Durapipe>
        <EXPLAIN type="int">2</EXPLAIN>
        <woods type="int">2</woods>
        <hanging type="int">3</hanging>
        <hastily type="int">2</hastily>
        <key type="int" name="27p">1</key>
        <localized type="int">1</localized>
        <Schuster type="int">5</Schuster>
        <regularize type="int">1</regularize>
        ....
    </root>

Here's the Python I'm using to generate this:

from __future__ import unicode_literals

import nltk.corpus
from nltk import FreqDist
import dicttoxml

#corpus
words = [w.decode('utf-8', errors='replace') for w in nltk.corpus.reuters.words()]
fd = FreqDist(words)
afd = dict(fd)

# special key for sum
afd['__sum__']=fd.N()

xml = dicttoxml.dicttoxml(afd)

f=open('frequencies.xml', 'w')
f.write(xml)
f.close()

I later ran the XML through XStream to convert it into a Java Map. Unfortunately, XStream cannot convert it due to an error in the XML code, on the occurence of the word 'key'. I can't find an error for the life of me. The XML Error looks like this:

[Fatal Error] frequencies.xml:1:27582: Element type "key" must be followed by either attribute specifications, ">" or "/>". Exception in thread "main" com.thoughtworks.xstream.io.StreamException: : Element type "key" must be followed by either attribute specifications, ">" or "/>".

So I have three questions here: What is this error? How can I fix the XML? How can I modify the Python code to generate correct XML?

Sorry for the lengthy question, but I'm inexperienced in both Python and XML. Any help you can give would be much appreciated. Thanks in advance!


Solution

  • nltk.corpus.reuters.words() returns a list that contains some "words" that can not be valid XML element names, for example, .'".

    When dicttoxml() encounters such a key in the afd dictionary it generates an element with the name "key" and with an attribute name containing the original (invalid) name, e.g.

    <key type="int" name=".'"">1</key>
    

    Clearly this is invalid XML and all XML parsers should (rightly) complain about it. xmllint does and you've found that XStream does too. dicttoxml() is not replacing characters such as double quotes (") with &quot;. To work around this you can call xml_escape() on the keys before running dicttoxml() (see dict comprehension below):

    from __future__ import unicode_literals
    
    import nltk.corpus
    from nltk import FreqDist
    from dicttoxml import dicttoxml, xml_escape
    
    #corpus
    words = [w.decode('utf-8', errors='replace') for w in nltk.corpus.reuters.words()]
    fd = FreqDist(words)
    afd = {xml_escape(k):v for k,v in fd.items()}
    
    # special key for sum
    afd['__sum__']=fd.N()
    
    xml = dicttoxml(afd)
    
    f=open('frequencies.xml', 'w')
    f.write(xml)
    f.close()