According to the xml spec the language attribute xml:lang
is inherited to children. But can I use that in any Python tools?
I found a related XPath question: Is there an easy way to get the implied xml:lang value for an element? That answer has a complex XPath:
ancestor-or-self::*[attribute::xml:lang][1]/@xml:lang
But I cannot figure out how I can use that in ElementTree or lxml. (Beautiful Soup does not support XPath at all.)
We have a collection of multi-language documents like this, and would like to collect and count <important>
-elements in different languages:
import lxml.etree as ET
# import xml.etree.ElementTree as ET
data = '''
<project>
<doc xml:lang="la">
<title>Lorem <important>ipsum</important></title>
dolor <important>sit</important> amet.
</doc>
<doc xml:lang="sv">
<title xml:lang="la">Consectetur <important>adipiscing</important> elit</title>
Viterligit warj allom them som <important>thetta breff</important> see.
</doc>
</project>'''
root = ET.fromstring(data)
importants = {'la': [], 'sv': []}
for important in root.iter('important'):
lang = important.attrib['{http://www.w3.org/XML/1998/namespace}lang']
importants[lang].append(important.text)
print(importants)
If I put the xml:lang
-attributes straight in the important
-tags, it works.
lxml supports XPath 1.0. Replace the lang
expression in the question with the following:
lang = important.xpath("ancestor-or-self::*[@xml:lang][1]/@xml:lang")[0]
Resulting output:
{'la': ['ipsum', 'sit', 'adipiscing'], 'sv': ['thetta breff']}