Search code examples
pythonxmlxpathlxmlmultilingual

How to use inherited xml:lang with Python?


According to the xml spec the language attribute xml:lang is inherited to children. But can I use that in any Python tools?

I found a related XPath question: Is there an easy way to get the implied xml:lang value for an element? That answer has a complex XPath:

ancestor-or-self::*[attribute::xml:lang][1]/@xml:lang

But I cannot figure out how I can use that in ElementTree or lxml. (Beautiful Soup does not support XPath at all.)

We have a collection of multi-language documents like this, and would like to collect and count <important>-elements in different languages:

import lxml.etree as ET
# import xml.etree.ElementTree as ET

data = '''
<project>
<doc xml:lang="la">
  <title>Lorem <important>ipsum</important></title>
  dolor <important>sit</important> amet.
</doc>
<doc xml:lang="sv">
  <title xml:lang="la">Consectetur <important>adipiscing</important> elit</title>
  Viterligit warj allom them som <important>thetta breff</important> see.
</doc>
</project>'''

root = ET.fromstring(data)

importants = {'la': [], 'sv': []}
for important in root.iter('important'):
    lang = important.attrib['{http://www.w3.org/XML/1998/namespace}lang']
    importants[lang].append(important.text)
print(importants)

If I put the xml:lang-attributes straight in the important-tags, it works.


Solution

  • lxml supports XPath 1.0. Replace the lang expression in the question with the following:

    lang = important.xpath("ancestor-or-self::*[@xml:lang][1]/@xml:lang")[0]
    

    Resulting output:

    {'la': ['ipsum', 'sit', 'adipiscing'], 'sv': ['thetta breff']}