Search code examples
pythonxpathsaxxmllint

extract two tags instead of one from xml file


I have this code that is working correctly.

It extracts all the titles of wikipedia articles.

import bz2
import xml.sax
import xml.sax.handler

class Handler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.__buffer = None

    def characters(self, data):
        if self.__buffer is None:
            return
        self.__buffer.append(data)

    def startElement(self, name, attrs):
        if name == 'title':
            self.__buffer = []

    def endElement(self, name):
        if self.__buffer is None:
            return
        print(repr(name), repr(''.join(self.__buffer)))
        self.__buffer = None

with bz2.open('/home/mrwiki-20210701-pages-meta-current.xml.bz2', 'r') as stream:
    xml.sax.parse(stream, Handler())

I am trying to extract the bytes parameter of "text" field along with the "title". This will not work because I need only "bytes" and not the actual text.

if name == 'title':
    self.__buffer = []
if name == 'text':
    self.__buffer = []

Here is a sample record...

myfile = """
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/expo
rt-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="mr">
  <siteinfo>
    <sitename>xyz</sitename>
    <dbname>mrwiki</dbname>
    <base>https://xx.wikipedia.org/wiki/xxxxxxxxxx</base>
    <generator>MediaWiki 1.37.0-wmf.11</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">मिडिया</namespace>
      <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>my_title </title>
    <ns>0</ns>
    <id>1</id>
    <revision>
      <id>1857942</id>
      <parentid>1629326</parentid>
      <timestamp>2020-12-26T11:34:51Z</timestamp>
      <contributor>
        <username>andesh9822</username>
        <id>66586</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="5823" xml:space="preserve"> some text
 </text>
      <sha1>11z9foqntwoukfd4xfjnfhpc9y33r25</sha1>
    </revision>
  </page>

"""

Current: my_title

Expected: my_title 5823


Solution

  • Here is how you can do it with ElementTree and iterparse():

    import bz2
    from xml.etree import ElementTree as ET
     
    with bz2.open("mrwiki-20210701-pages-meta-current.xml.bz2", "r") as stream:
        for _, elem in ET.iterparse(stream):
            if elem.tag == "{http://www.mediawiki.org/xml/export-0.10/}title":
                print(elem.text)
            if elem.tag == "{http://www.mediawiki.org/xml/export-0.10/}text":
                print(elem.get("bytes"))
            elem.clear()
    

    iterparse() builds up a tree structure that will use a lot of memory. elem.clear() remedies that by removing all content from the elements once they have been processed.

    The elements in the XML file are bound to the http://www.mediawiki.org/xml/export-0.10/ namespace. This must be taken into account.


    And here is SAX-based code that does the same.

    import bz2
    import xml.sax
    import xml.sax.handler
     
    class Handler(xml.sax.handler.ContentHandler):
        def characters(self, data):
            self.__buffer = data
     
        def startElement(self, name, attrs):
            if name == "title":
                self.__buffer = ""
            if name == "text":
                self.__buffer2 = attrs.getValue("bytes")
                
        def endElement(self, name):
            if name == "title":
                print(self.__buffer)
            if name == "text":
                print(self.__buffer2)
     
    with bz2.open("mrwiki-20210701-pages-meta-current.xml.bz2", "r") as stream:
        xml.sax.parse(stream, Handler())
    

    A SAX parser consumes very little memory as it just reports events as they occur.

    By default, xml.sax.handler.feature_namespaces is false, which means that namespace-related events aren't reported by the parser. It is as if there was no namespace.