Search code examples
pythonxmlsax

parseString not working for me in xml.sax (Python)


I need to validate xml but the code comes in a variable (str), not from a file.

So I figured this would be easy to do with xml.sax. But I can't get it to work for me. It works fine when parsing a file, but I get a strange error when parsing a string.

Here's my test-code:

from xml.sax import make_parser, parseString
import os

filename = os.path.join('.', 'data', 'data.xml')
xmlstr = "<note>\n<to>Mary</to>\n<from>Jane</from>\n<heading>Reminder</heading>\n<body>Go to the zoo</body>\n</note>"


def parsefile(file):
    parser = make_parser()
    parser.parse(file)


def parsestr(xmlstr):
    parser = make_parser()
    parseString(xmlstr.encode('utf-8'), parser)


try:
    parsefile(filename)
    print("%s is well-formed" % filename)
except Exception as e:
    print("%s is NOT well-formed! %s" % (filename, e))


try:
    parsestr(xmlstr)
    print("%s is well-formed" % ('xml string'))
except Exception as e:
    print("%s is NOT well-formed! %s" % ('xml string', e))

When executing the script, I get this:

./data/data.xml is well-formed
xml string is NOT well-formed! 'ExpatParser' object has no attribute 'processingInstruction'

What am I missing?


Solution

  • The second argument to parseString is supposed to be a ContentHandler, not a parser. Because you're passing in the wrong type of object, it doesn't have the expected methods.

    You're expected to subclass ContentHandler and then handle the SAX events as necessary. In this case, you're not actually trying to extract any information from the document, so you could use the base ContentHandler class:

    from xml.sax import parseString, SAXParseException
    from xml.sax.handler import ContentHandler
    
    xmlstr = "<note>\n<to>Mary</to>\n<from>Jane</from>\n<heading>Reminder</heading>\n<body>Go to the zoo</body>\n</note>"
    
    try:
        parseString(xmlstr, ContentHandler())
        print("document is well formed")
    except SAXParseException as err:
        print("document is not well-formed:", err)