Search code examples
pythonpython-3.xxmlpython-2.xsax

Text between tag using SAX parser in Python


I want to print the text between a particular tag in an XML file using SAX.

However, some of the text output consist of spaces or a newline character.

Is there a way to just pick out the actual strings? What am I doing wrong?

See code extract and XML document below.

(I get the same effect with both Python 2 and Python 3.)

#!/usr/bin/env python3

import xml.sax

class MyHandler(xml.sax.ContentHandler):

        def startElement(self, name, attrs):
                self.tag = name

        def characters(self, content):
                if self.tag == "artist":
                        print('[%s]' % content)

if __name__=='__main__':
        parser=xml.sax.make_parser()
        Handler=MyHandler()
        parser.setContentHandler(Handler) #overriding default ContextHandler
        parser.parse("songs.xml")
<?xml version="1.0"?>
<genre catalogue="Pop">
  <song title="No Tears Left to Cry">
    <artist>Ariana Grande</artist>
    <year>2018</year>
    <album>Sweetener</album>
  </song>
  <song title="Delicate">
    <artist>Taylor Swift</artist>
    <year>2018</year>
    <album>Reputation</album>
  </song>
  <song title="Mrs. Potato Head">
    <artist>Melanie Martinez</artist>
    <year>2015</year>
    <album>Cry Baby</album>
  </song>
</genre>

Solution

  • The value of self.tag is set to "artist" when the <artist> start tag is encountered, and it does not change until startElement() is called for the <year> start tag. Between those elements is some uninteresting whitespace for which SAX events are also reported by the parser.

    One way to get around this is to add an endElement() method to MyHandler that sets self.tag to something else.

    def endElement(self, name):
        self.tag = "whatever"