Search code examples

Python tree.iterparse export source XML of selected element including all descendants

Python 3.4, parsing GB++ size XML Wikipedia dump files using etree.iterparse. I want to test within the current matched <page> element for its <ns> value, depending on the latter value I then want export the source XML of the whole <page> object and all its contents including any elements nested within it, i.e. the XML of a whole article.

I can iterate the <page> objects and find the ones I want, but then all available functions seem to want to read text/attribute values, whereas I simply want a utf8 string copy of the source file's XML code for the complete in scope <page> object. Is this possible?

A cut-down version of the XML looks like this:

<mediawiki xmlns="" xml:lang="en">
    <title>Some Article</title>
      <text xml:space="preserve">some text</text>
      <text xml:space="preserve">blah blah</text>

The python code getting me to the <ns> value test is here:

``from lxml import etree

# store namespace string for all elements (only one used in Wikipedia XML docs)
ns = {'wiki' : ''}

context = etree.iterparse('src.xml', events=('end',))
for event, elem in context:
  # at end of parsing each
  if elem.tag == (NAMESPACE+'page') and event == 'end':
    tagNs = elem.find('wiki:ns',ns)
    if tagNs is not None:
      nsValue = tagNs.text
      if nsValue == '2':
        # export the current <page>'s XML code

In this case I'd want to extract the XML code of only the second <page> element, i.e. a string holding:

      <text xml:space="preserve">blah blah</text>

edit: minor typo and better mark-up


  • You can do this.

    >>> from lxml import etree
    >>> mediawiki = etree.iterparse('mediawiki.xml')
    >>> page_content = {}
    >>> for ev, el in mediawiki:
    ...     if el.tag=='page':
    ...         if page_content['ns']=='2':
    ...             print (page_content)
    ...         page_content = {}
    ...     else:
    ...         page_content[el.tag.replace('{}', '')] = \
    ...             el.text.strip() if el.text else None
    >>> page_content
    {'mediawiki': '', 'revision': '', 'timestamp': '2017-07-27T00:59:41Z', 'title': 'User:Wonychifans', 'page': '', 'text': 'blah blah', 'ns': '2'}

    Because the structure of the output xml is quite simple there should be no difficulty in constructing it from the dictionary.

    Edit: Although this approach requires two passes through the xml file it could be faster and it does recover the required xml.

    First, look for the starting lines of the page elements.

    >>> from lxml import etree
    >>> mediawiki = etree.iterparse('mediawiki.xml', events=("start", "end"))
    >>> for ev, el in mediawiki:
    ...     tag = el.tag[1+el.tag.rfind('}'):]
    ...     if ev=='start' and tag=='page':
    ...         keep=False
    ...     if ev=='start' and tag=='ns' and el.text=='2':
    ...         keep=True
    ...     if ev=='end' and tag=='page' and keep:
    ...         print (el.sourceline)

    The go through the xml again to find the complete page entries using the starting points.

    >>> with open('mediawiki.xml') as mediawiki:
    ...     for _ in range(9):
    ...         r = next(mediawiki)
    ...     for line in mediawiki:
    ...         print (line.strip())
    ...         if '</page>' in line:
    ...             break
    <text xml:space="preserve">blah blah</text>