Search code examples
pythonxmlxml-parsinglxml

How to get all XPaths from XML with just key names and no template URLs, with Python


I need to extract XPaths and values from XML object. Currently I use lxml which with either gives long paths with repeated template URLS or just indices of XPaths keys without names.

Question: How to get Xpaths with just names, without template URLs. Yes, string cleanup after parsing works, but I hope to find a clean solution using lxml or similar library

  1. with getelementpath(): has template URLs and '\n\t\t' in empty keys.
>> [(root1.getelementpath(e), e.text) for e in root1.iter()][5:10]

[('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
  'ISO_639-1'),
 ('{http://schemas.oceanehr.com/templates}language/{http://schemas.oceanehr.com/templates}code_string',
  'xx'),
 ('{http://schemas.oceanehr.com/templates}territory', '\n\t\t'),
 ('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id',
  '\n\t\t\t'),
 ('{http://schemas.oceanehr.com/templates}territory/{http://schemas.oceanehr.com/templates}terminology_id/{http://schemas.oceanehr.com/templates}value',
  'ISO_3166-1')]
  1. with getpath(): has no key names URLs and '\n\t\t' in empty keys.
>> [(root1.getpath(e), e.text) for e in root1.iter()][5:10]

[('/*/*[2]/*[1]/*', 'ISO_639-1'),
 ('/*/*[2]/*[2]', 'xx'),
 ('/*/*[3]', '\n\t\t'),
 ('/*/*[3]/*[1]', '\n\t\t\t'),
 ('/*/*[3]/*[1]/*', 'ISO_3166-1')]
  1. what I need: key names URLs and None in empty keys. I believe I've seen it somewhere, but can't find now...
[('language/terminology_id/value', 'ISO_639-1'),
('language/code_string','xx'),
('territory', None),
('territory/terminology_id', None),
('territory/terminology_id/value', 'ISO_3166-1')]

this is the XML header:

<?xml version="1.0" ?>
<Lab test results
        xmlns="http://schemas.oceanehr.com/templates"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:rm="http://schemas.openehr.org/v1"
        template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
    <name>
        <value>Lab test results</value>
    </name>
    <language>
        <terminology_id>
            <value>ISO_639-1</value>
        </terminology_id>
        <code_string>ru</code_string>

Solution

  • I'd still use .getpath().

    The reason you're getting * in your paths is because your XML has a default namespace. By using * the namespace doesn't need to be taken into account when creating a usable xpath.

    To resolve this, first set the element name (.tag) to the local-name (element name without prefix or uri).

    Also, you can create an XMLParser and set remove_blank_text to True to get rid of the entries that are only whitespace.

    Example...

    XML Input (test.xml)

    <Lab_test_results
            xmlns="http://schemas.oceanehr.com/templates"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xmlns:rm="http://schemas.openehr.org/v1"
            template_id="openEHR-EHR-COMPOSITION.t_laboratory_test_result_report.v2.1">
        <name>
            <value>Lab test results</value>
        </name>
        <language>
            <terminology_id>
                <value>ISO_639-1</value>
            </terminology_id>
        </language>
    </Lab_test_results>
    

    Python

    from lxml import etree
    from pprint import pprint
    
    parser = etree.XMLParser(remove_blank_text=True)
    
    tree = etree.parse('test.xml', parser=parser)
    
    xpaths = []
    
    for elem in tree.iter():
        elem.tag = etree.QName(elem).localname
        xpaths.append((tree.getpath(elem), elem.text))
    
    pprint(xpaths)
    

    Printed Output

    [('/Lab_test_results', None),
     ('/Lab_test_results/name', None),
     ('/Lab_test_results/name/value', 'Lab test results'),
     ('/Lab_test_results/language', None),
     ('/Lab_test_results/language/terminology_id', None),
     ('/Lab_test_results/language/terminology_id/value', 'ISO_639-1')]
    

    If you need to also collect attributes, you can make a few small changes...

    for elem in tree.iter():
        elem.tag = etree.QName(elem).localname
        xpath = tree.getpath(elem)
        xpaths.append((xpath, elem.text))
        for attr in elem.attrib:
            xpaths.append((f"{xpath}/@{attr}", elem.get(attr)))