Python 3.4, parsing GB++ size XML Wikipedia dump files using etree.iterparse. I want to test within the current matched <page>
element for its <ns>
value, depending on the latter value I then want export the source XML of the whole <page>
object and all its contents including any elements nested within it, i.e. the XML of a whole article.
I can iterate the <page>
objects and find the ones I want, but then all available functions seem to want to read text/attribute values, whereas I simply want a utf8 string copy of the source file's XML code for the complete in scope <page>
object. Is this possible?
A cut-down version of the XML looks like this:
<mediawiki xmlns="" xml:lang="en">
<title>Some Article</title>
<text xml:space="preserve">some text</text>
<text xml:space="preserve">blah blah</text>
The python code getting me to the <ns>
value test is here:
``from lxml import etree
# store namespace string for all elements (only one used in Wikipedia XML docs)
ns = {'wiki' : ''}
context = etree.iterparse('src.xml', events=('end',))
for event, elem in context:
# at end of parsing each
if elem.tag == (NAMESPACE+'page') and event == 'end':
tagNs = elem.find('wiki:ns',ns)
if tagNs is not None:
nsValue = tagNs.text
if nsValue == '2':
# export the current <page>'s XML code
In this case I'd want to extract the XML code of only the second <page>
element, i.e. a string holding:
<text xml:space="preserve">blah blah</text>
edit: minor typo and better mark-up
You can do this.
>>> from lxml import etree
>>> mediawiki = etree.iterparse('mediawiki.xml')
>>> page_content = {}
>>> for ev, el in mediawiki:
... if el.tag=='page':
... if page_content['ns']=='2':
... print (page_content)
... page_content = {}
... else:
... page_content[el.tag.replace('{}', '')] = \
... el.text.strip() if el.text else None
>>> page_content
{'mediawiki': '', 'revision': '', 'timestamp': '2017-07-27T00:59:41Z', 'title': 'User:Wonychifans', 'page': '', 'text': 'blah blah', 'ns': '2'}
Because the structure of the output xml is quite simple there should be no difficulty in constructing it from the dictionary.
Edit: Although this approach requires two passes through the xml file it could be faster and it does recover the required xml.
First, look for the starting lines of the page
>>> from lxml import etree
>>> mediawiki = etree.iterparse('mediawiki.xml', events=("start", "end"))
>>> for ev, el in mediawiki:
... tag = el.tag[1+el.tag.rfind('}'):]
... if ev=='start' and tag=='page':
... keep=False
... if ev=='start' and tag=='ns' and el.text=='2':
... keep=True
... if ev=='end' and tag=='page' and keep:
... print (el.sourceline)
The go through the xml again to find the complete page
entries using the starting points.
>>> with open('mediawiki.xml') as mediawiki:
... for _ in range(9):
... r = next(mediawiki)
... for line in mediawiki:
... print (line.strip())
... if '</page>' in line:
... break
<text xml:space="preserve">blah blah</text>