Search code examples
pythonxmllxmlelementtreexmltodict

Python : How to navigate XML sub-nodes efficiently?


I am trying to extract certain data points from XML and have tried two options...

  1. Working with XML format using ElementTree
  2. Working with Dictionary using xmltodict

Here's what I have got so far,

Code

# Packages
# --------------------------------------
import xml.etree.ElementTree as ET

# XML Data
# --------------------------------------
message_xml = \
'<ClinicalDocument> \
    <code code="34133-9" displayName="Summarization of Episode Note"/> \
    <title>Care Summary</title> \
    <recordTarget> \
        <patientRole> \
            <id assigningAuthorityName="LOCAL" extension="L123456"/> \
            <id assigningAuthorityName="SSN" extension="788889999"/> \
            <id assigningAuthorityName="GLOBAL" extension="G123456"/> \
            <addr use="HP"> \
                <streetAddressLine>1000 N SOME AVENUE</streetAddressLine> \
                <city>BIG CITY</city> \
                <state>NA</state> \
                <postalCode>12345-1010</postalCode> \
                <country>US</country> \
            </addr> \
            <telecom nullFlavor="NI"/> \
            <patient> \
                <name use="L"> \
                    <given>JANE</given> \
                    <given>JOE</given> \
                    <family>DOE</family> \
                </name> \
            </patient> \
        </patientRole> \
    </recordTarget> \
</ClinicalDocument>'

# Get Tree & Root
# --------------------------------------
tree = ET.ElementTree(ET.fromstring(message_xml))
root = tree.getroot()

# Iterate
# --------------------------------------
for node in root:

    tag = node.tag
    attribute = node.attrib

    # Get ClinicalDocument.code values
    if tag == 'code':
        document_code_code = attribute.get('code')
        document_code_name = attribute.get('displayName')

    else:
        pass

    # Get ClinicalDocument.recordTarget values
    if tag == 'recordTarget':

        for child in node.iter():

            # Multiple <id> tags
            record_target_local = ??
            record_target_ssn = ??
            record_target_global = ??

            # Multiple <given> tags
            record_target_name_first = ??
            record_target_name_middle = ??
            record_target_name_last = ??

    else:
        pass

Expected Output

document_code,document_name,id_local,id_ssn,id_global,name_first, name_middle,name_last
34133-9,Summarization of Episode Note,L123456,788889999,G123456,JANE,JOE,DOE

Acceptable Output

document_code,document_name,id_type,id,name_first,name_middle,name_last
34133-9,Summarization of Episode Note,LOCAL,L123456,JANE,JOE,DOE
34133-9,Summarization of Episode Note,SSN,788889999,JANE,JOE,DOE
34133-9,Summarization of Episode Note,GLOBAL,G123456,JANE,JOE,DOE

Questions

  1. How to efficiently navigate child-nodes with multiple child-nodes under them?
  2. How to handle duplicate tags (ex: <id>, <given>)?

Solution

  • How to efficiently navigate child-nodes with multiple child-nodes under them?

    A good way to navigate XML is with XPath. ElementTree has limited XPath support, but it appears good enough for what you need. If you end up needing to use more complicated XPath, I'd suggest using XPath in lxml.

    How to handle duplicate tags (ex: <id>, <given>)?

    It depends on what you need to do with those elements. For example, if you want separate rows for each id element, you'd need to iterate over each one (with findall() in ElementTree or xpath() in lxml).

    If you just want a value (either text or an attribute value), you need to narrow it down to a single element in the XPath.

    For example, an id element that has an assigningAuthorityName attribute value equal to LOCAL would be id[@assigningAuthorityName='LOCAL'].

    The given element is a little trickier; how can you tell one is the first name and one is the middle name? The only way I can see is position; the first given (given[1]) is the first name and the second given (given[2]) is the second name. Are you guaranteed to always have two given elements? If not, you may need to do some checking or try/except statements to get the needed output.

    Also, since you're creating csv output, I'd recommend using the csv module; specifically DictWriter.

    This will allow you to store the values from the XML in a dict to write rows. You can create new copies of the dict for new rows while maintaining common values (like document_code and document_name).

    Here's an example that will create a new row for each recordTarget.

    XML Input (input.xml)

    <ClinicalDocument> 
        <code code="34133-9" displayName="Summarization of Episode Note"/> 
        <title>Care Summary</title> 
        <recordTarget> 
            <patientRole> 
                <id assigningAuthorityName="LOCAL" extension="L123456"/> 
                <id assigningAuthorityName="SSN" extension="788889999"/> 
                <id assigningAuthorityName="GLOBAL" extension="G123456"/> 
                <addr use="HP"> 
                    <streetAddressLine>1000 N SOME AVENUE</streetAddressLine> 
                    <city>BIG CITY</city> 
                    <state>NA</state> 
                    <postalCode>12345-1010</postalCode> 
                    <country>US</country> 
                </addr> 
                <telecom nullFlavor="NI"/> 
                <patient> 
                    <name use="L"> 
                        <given>JANE</given> 
                        <given>JOE</given> 
                        <family>DOE</family> 
                    </name> 
                </patient> 
            </patientRole> 
        </recordTarget>
    </ClinicalDocument>
    

    Python

    import csv
    import xml.etree.ElementTree as ET
    from copy import deepcopy
    
    values_template = {"document_code": "", "document_name": "", "id_local": "", "id_ssn": "",
                       "id_global": "", "name_first": "", "name_middle": "", "name_last": ""}
    
    with open("output.csv", "w", newline="") as csvfile:
        csvwriter = csv.DictWriter(csvfile, delimiter=",", quoting=csv.QUOTE_MINIMAL,
                                   fieldnames=[name for name in values_template])
        csvwriter.writeheader()
    
        tree = ET.parse('input.xml')
    
        values_template["document_code"] = tree.find("code").get("code")
        values_template["document_name"] = tree.find("code").get("displayName")
    
        for target in tree.findall("recordTarget"):
    
            values = deepcopy(values_template)
    
            values["id_local"] = target.find("patientRole/id[@assigningAuthorityName='LOCAL']").get("extension")
            values["id_ssn"] = target.find("patientRole/id[@assigningAuthorityName='SSN']").get("extension")
            values["id_global"] = target.find("patientRole/id[@assigningAuthorityName='GLOBAL']").get("extension")
            values["name_first"] = target.find("patientRole/patient/name/given[1]").text
            values["name_middle"] = target.find("patientRole/patient/name/given[2]").text
            values["name_last"] = target.find("patientRole/patient/name/family").text
    
            csvwriter.writerow(values)
    

    CSV Output (output.csv)

    document_code,document_name,id_local,id_ssn,id_global,name_first,name_middle,name_last
    34133-9,Summarization of Episode Note,L123456,788889999,G123456,JANE,JOE,DOE
    

    Here's another example that will create a new row for each recordTarget/patientRole/id...

    Python

    import csv
    import xml.etree.ElementTree as ET
    from copy import deepcopy
    
    values_template = {"document_code": "", "document_name": "", "id": "",
                       "name_first": "", "name_middle": "", "name_last": ""}
    
    with open("output.csv", "w", newline="") as csvfile:
        csvwriter = csv.DictWriter(csvfile, delimiter=",", quoting=csv.QUOTE_MINIMAL,
                                   fieldnames=[name for name in values_template])
        csvwriter.writeheader()
    
        tree = ET.parse('input.xml')
    
        values_template["document_code"] = tree.find("code").get("code")
        values_template["document_name"] = tree.find("code").get("displayName")
    
        for target in tree.findall("recordTarget"):
    
            values = deepcopy(values_template)
    
            values["name_first"] = target.find("patientRole/patient/name/given[1]").text
            values["name_middle"] = target.find("patientRole/patient/name/given[2]").text
            values["name_last"] = target.find("patientRole/patient/name/family").text
    
            for role_id in target.findall("patientRole/id"):
                values["id"] = role_id.get("extension")
                csvwriter.writerow(values)
    

    CSV Output (output.csv)

    document_code,document_name,id,name_first,name_middle,name_last
    34133-9,Summarization of Episode Note,L123456,JANE,JOE,DOE
    34133-9,Summarization of Episode Note,788889999,JANE,JOE,DOE
    34133-9,Summarization of Episode Note,G123456,JANE,JOE,DOE