Search code examples

extract two tags instead of one from xml file

I have this code that is working correctly.

It extracts all the titles of wikipedia articles.

import bz2
import xml.sax
import xml.sax.handler

class Handler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.__buffer = None

    def characters(self, data):
        if self.__buffer is None:

    def startElement(self, name, attrs):
        if name == 'title':
            self.__buffer = []

    def endElement(self, name):
        if self.__buffer is None:
        print(repr(name), repr(''.join(self.__buffer)))
        self.__buffer = None

with'/home/mrwiki-20210701-pages-meta-current.xml.bz2', 'r') as stream:
    xml.sax.parse(stream, Handler())

I am trying to extract the bytes parameter of "text" field along with the "title". This will not work because I need only "bytes" and not the actual text.

if name == 'title':
    self.__buffer = []
if name == 'text':
    self.__buffer = []

Here is a sample record...

myfile = """
<mediawiki xmlns="" xmlns:xsi="" xsi:schemaLocation="
rt-0.10/" version="0.10" xml:lang="mr">
    <generator>MediaWiki 1.37.0-wmf.11</generator>
      <namespace key="-2" case="first-letter">मिडिया</namespace>
      <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
    <title>my_title </title>
      <text bytes="5823" xml:space="preserve"> some text


Current: my_title

Expected: my_title 5823


  • Here is how you can do it with ElementTree and iterparse():

    import bz2
    from xml.etree import ElementTree as ET
    with"mrwiki-20210701-pages-meta-current.xml.bz2", "r") as stream:
        for _, elem in ET.iterparse(stream):
            if elem.tag == "{}title":
            if elem.tag == "{}text":

    iterparse() builds up a tree structure that will use a lot of memory. elem.clear() remedies that by removing all content from the elements once they have been processed.

    The elements in the XML file are bound to the namespace. This must be taken into account.

    And here is SAX-based code that does the same.

    import bz2
    import xml.sax
    import xml.sax.handler
    class Handler(xml.sax.handler.ContentHandler):
        def characters(self, data):
            self.__buffer = data
        def startElement(self, name, attrs):
            if name == "title":
                self.__buffer = ""
            if name == "text":
                self.__buffer2 = attrs.getValue("bytes")
        def endElement(self, name):
            if name == "title":
            if name == "text":
    with"mrwiki-20210701-pages-meta-current.xml.bz2", "r") as stream:
        xml.sax.parse(stream, Handler())

    A SAX parser consumes very little memory as it just reports events as they occur.

    By default, xml.sax.handler.feature_namespaces is false, which means that namespace-related events aren't reported by the parser. It is as if there was no namespace.