Search code examples
pythonpandasxmlelementtree

how to parse xml with ElementTree to use it in pandas


I have a rather big xml file with multiple different elements, similar to the one bellow:

<adrmsg:ADRMessage xmlns:adrmsg="http://www.eurocontrol.int/cfmu/b2b/ADRMessage"
    xmlns:gml="http://www.opengis.net/gml/3.2" gml:id="ID_197112_1650420171084_1"
    xmlns:adrext="http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR"
    xmlns:aixm="http://www.aixm.aero/schema/5.1.1"
    xmlns:xlink="http://www.w3.org/1999/xlink">
    <adrmsg:hasMember>
        <aixm:Airspace gml:id="ID_197112_1650420171084_93332">
            <gml:identifier codeSpace="urn:uuid:">3271922d-6b7a-4953-a6ff-599b17ab785e</gml:identifier>
            <aixm:timeSlice>
                <aixm:AirspaceTimeSlice gml:id="ID_197112_1650420171084_93333">
                    <gml:validTime>
                        <gml:TimePeriod gml:id="ID_197112_1650420171084_93334">
                            <gml:beginPosition>2021-10-07T00:00:00</gml:beginPosition>
                            <gml:endPosition indeterminatePosition="unknown"/>
                        </gml:TimePeriod>
                    </gml:validTime>
                    <aixm:interpretation>BASELINE</aixm:interpretation>
                    <aixm:featureLifetime>
                        <gml:TimePeriod gml:id="ID_197112_1650420171084_93335">
                            <gml:beginPosition>2021-10-07T00:00:00</gml:beginPosition>
                            <gml:endPosition indeterminatePosition="unknown"/>
                        </gml:TimePeriod>
                    </aixm:featureLifetime>
                    <aixm:type>RAS</aixm:type>
                    <aixm:designator>EDGGNFRA</aixm:designator>
                    <aixm:name>EDGG NON FRA</aixm:name>
                    <aixm:designatorICAO>NO</aixm:designatorICAO>
                    <aixm:geometryComponent>
                        <aixm:AirspaceGeometryComponent gml:id="ID_197112_1650420171084_93336">
                            <aixm:operation>BASE</aixm:operation>
                            <aixm:theAirspaceVolume>
                                <aixm:AirspaceVolume gml:id="ID_197112_1650420171084_93337">
                                    <aixm:upperLimit uom="FL">265</aixm:upperLimit>
                                    <aixm:upperLimitReference>STD</aixm:upperLimitReference>
                                    <aixm:lowerLimit uom="FL">245</aixm:lowerLimit>
                                    <aixm:lowerLimitReference>STD</aixm:lowerLimitReference>
                                    <aixm:contributorAirspace>
                                        <aixm:AirspaceVolumeDependency gml:id="ID_197112_1650420171084_93338">
                                            <aixm:dependency>HORZ_PROJECTION</aixm:dependency>
                                            <aixm:theAirspace xlink:href="urn:uuid:5831b5a2-4861-4bf5-ae99-d31413234cdb"/>
                                        </aixm:AirspaceVolumeDependency>
                                    </aixm:contributorAirspace>
                                </aixm:AirspaceVolume>
                            </aixm:theAirspaceVolume>
                        </aixm:AirspaceGeometryComponent>
                    </aixm:geometryComponent>
                    <aixm:geometryComponent>
                        <aixm:AirspaceGeometryComponent gml:id="ID_197112_1650420171084_93339">
                            <aixm:operation>UNION</aixm:operation>
                            <aixm:theAirspaceVolume>
                                <aixm:AirspaceVolume gml:id="ID_197112_1650420171084_93340">
                                    <aixm:upperLimit uom="FL">255</aixm:upperLimit>
                                    <aixm:upperLimitReference>STD</aixm:upperLimitReference>
                                    <aixm:lowerLimit uom="FL">245</aixm:lowerLimit>
                                    <aixm:lowerLimitReference>STD</aixm:lowerLimitReference>
                                    <aixm:contributorAirspace>
                                        <aixm:AirspaceVolumeDependency gml:id="ID_197112_1650420171084_93341">
                                            <aixm:dependency>HORZ_PROJECTION</aixm:dependency>
                                            <aixm:theAirspace xlink:href="urn:uuid:dcd8301c-de12-4e6c-992f-fd8de781ab58"/>
                                        </aixm:AirspaceVolumeDependency>
                                    </aixm:contributorAirspace>
                                </aixm:AirspaceVolume>
                            </aixm:theAirspaceVolume>
                        </aixm:AirspaceGeometryComponent>
                    </aixm:geometryComponent>
                    <aixm:extension>
                        <adrext:AirspaceExtension gml:id="ID_197112_1650420171084_93342">
                            <adrext:usage>OPERATIONAL</adrext:usage>
                        </adrext:AirspaceExtension>
                    </aixm:extension>
                </aixm:AirspaceTimeSlice>
            </aixm:timeSlice>
        </aixm:Airspace>
    </adrmsg:hasMember>
.... many other <adrmsg:hasMember>
</adrmsg:ADRMessage>

I only added one of those elements + the namespaces .

My attempt of code :

import xml.etree.ElementTree as ET
import pandas as pd

ab = {"adrmsg":"http://www.eurocontrol.int/cfmu/b2b/ADRMessage",
    "gml":"http://www.opengis.net/gml/3.2",
    "adrext":"http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR",
    "aixm": "http://www.aixm.aero/schema/5.1.1",
    "xlink":"http://www.w3.org/1999/xlink",
    "id":"http://www.opengis.net/gml/3.2",
    "href":"http://www.w3.org/1999/xlink"
}

root_node = ET.parse('Airspace.xml').getroot()

pipare = []
verate = []
for tag in root_node.findall(".//aixm:Airspace" , ab):
    value = tag.find("gml:identifier", ab)
    for char in tag.findall(".//aixm:AirspaceTimeSlice", ab):
        for per in char.findall(".//aixm:type",ab):
            for ir in char.findall(".//aixm:name",ab):
                for epa in char.findall(".//aixm:designator", ab):
                    for op in char.findall(".//aixm:theAirspace[@xlink:href]", ab):
                        pipare = [value.text, char.attrib,per.text,ir.text,epa.text,op.attrib]
                        verate.append(pipare)
                        
                       
xml_todf = pd.DataFrame(verate, columns=['uuid','id','type','name','designator','contributorAirspace'])

As you could probably see, I am in a very 'rough' way trying to parse that XML, extract the elements that I am interested in and finally put them into a pandas DataFrame.

When I am 'capturing' the .text the data extracted is what I want, but when it comes about capturing the attributes, the result is not only the values but also the namespaces...I dont know what to do to solve this. Let me share how the pandas DataFrame displays that data:

uuid id type name designator contributorAirspace
3271922d-6b7a-4953-a6ff-599b17ab785e {'{http://www.opengis.net/gml/3.2}id': 'ID_197112_1650420171084_93333'} RAS EDGG NON FRA EDGGNFRA {'{http://www.w3.org/1999/xlink}href': 'urn:uuid:5831b5a2-4861-4bf5-ae99-d31413234cdb'}
3271922d-6b7a-4953-a6ff-599b17ab785e {'{http://www.opengis.net/gml/3.2}id': 'ID_197112_1650420171084_93333'} RAS EDGG NON FRA EDGGNFRA {'{http://www.w3.org/1999/xlink}href': 'urn:uuid:dcd8301c-de12-4e6c-992f-fd8de781ab58'}

I would like to have ideally something like this:

uuid id type name designator contributorAirspace
3271922d-6b7a-4953-a6ff-599b17ab785e 'ID_197112_1650420171084_93333'} RAS EDGG NON FRA EDGGNFRA 5831b5a2-4861-4bf5-ae99-d31413234cdb , dcd8301c-de12-4e6c-992f-fd8de781ab58

but I would be very grateful if somebody could help me reach this point:

uuid id type name designator contributorAirspace
3271922d-6b7a-4953-a6ff-599b17ab785e 'ID_197112_1650420171084_93333'} RAS EDGG NON FRA EDGGNFRA 5831b5a2-4861-4bf5-ae99-d31413234cdb
3271922d-6b7a-4953-a6ff-599b17ab785e 'ID_197112_1650420171084_93333'} RAS EDGG NON FRA EDGGNFRA dcd8301c-de12-4e6c-992f-fd8de781ab58

Thanks for your help


Solution

  • Python elementtree requires to address attribute with the namespace by its qualified name (i.e. namespace + attribute name). When referring char.attrib or op.attrib a dictionary containing all element attributes with their values is retrieved. Here is an example of attribute value retrieval:

    import xml.etree.ElementTree as ET
    import pandas as pd
    from collections import defaultdict
    
    ab = {"adrmsg":"http://www.eurocontrol.int/cfmu/b2b/ADRMessage",
        "gml":"http://www.opengis.net/gml/3.2",
        "adrext":"http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR",
        "aixm": "http://www.aixm.aero/schema/5.1.1",
        "xlink":"http://www.w3.org/1999/xlink",
        "id":"http://www.opengis.net/gml/3.2",
        "href":"http://www.w3.org/1999/xlink"
    }
    
    # parse XML
    root_node = ET.fromstring(xml)
    
    # create dictionary to store parsed data
    data = defaultdict(list)
    
    for tag in root_node.findall(".//aixm:Airspace" , ab):
        value = tag.find("gml:identifier", ab)
        for char in tag.findall(".//aixm:AirspaceTimeSlice", ab):
            for per in char.findall(".//aixm:type",ab):
                for ir in char.findall(".//aixm:name",ab):
                    for epa in char.findall(".//aixm:designator", ab):
                        for op in char.findall(".//aixm:theAirspace[@xlink:href]", ab):
                            data['uuid'].append(value.text)
                            data['id'].append(char.attrib['{http://www.opengis.net/gml/3.2}id'])
                            data['type'].append(per.text)
                            data['name'].append(ir.text)
                            data['designator'].append(epa.text)
                            #data['contributorAirspace'].append(op.attrib['{http://www.w3.org/1999/xlink}href'])
                            
                           
    df = pd.DataFrame(data)
    

    Note the expressions char.attrib['{http://www.opengis.net/gml/3.2}id'] and op.attrib['{http://www.w3.org/1999/xlink}href']. They address attrbutes using qualified names and retrieve attrbute values.

    Also this example uses defaultdict instead of two lists, but that's matter of taste.