Search code examples
pythonxmlelementtreejmespath

How to extract information from multiple XML nodes and hierarchies using python?


I have the following structure:

<population>
    <person id="101">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >53</attribute>
        </attributes>
        <plan score="-0.38" selected="yes">
            <activity type="outside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
            </activity>
            <leg mode="car" dep_time="08:22:00" trav_time="00:10:13">
                <route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
            </leg>
            <activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
                <attributes>
                    <attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
                </attributes>
            </activity>
            <leg mode="car" dep_time="17:15:22" trav_time="00:07:05">
                <route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
            </leg>
            <activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
            </activity>
        </plan>
        <plan score="-0.38" selected="no">
            <activity type="inside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
            </activity>
            <leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
                <route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
            </leg>
            <activity type="shopping" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
                <attributes>
                    <attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
                </attributes>
            </activity>
            <leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
                <route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
            </leg>
            <activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
                <attributes>
                    <attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
                </attributes>
            </activity>
            <leg mode="pt" dep_time="17:15:22" trav_time="00:07:05">
                <route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
            </leg>
            <activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
            </activity>
        </plan>
    </person>
    <person id="102">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >53</attribute>
        </attributes>
        <plan score="-0.38" selected="yes">
            <activity type="inside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
            </activity>
            <leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
                <route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
            </leg>
            <activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
                <attributes>
                    <attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
                </attributes>
            </activity>
            <leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
                <route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
            </leg>
            <activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
                <attributes>
                    <attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
                </attributes>
            </activity>
            <leg mode="pt" dep_time="17:15:22" trav_time="00:07:05">
                <route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
            </leg>
            <activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
            </activity>
        </plan>
    </person>
    <person id="103">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >53</attribute>
        </attributes>
        <plan score="-0.38" selected="yes">
            <activity type="inside" link="81312" facility="outside_208" x="649324.9906891582" y="6866581.699995641" end_time="08:22:00" >
            </activity>
            <leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
                <route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
            </leg>
            <activity type="shopping" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
                <attributes>
                    <attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
                </attributes>
            </activity>
            <leg mode="bike" dep_time="08:22:00" trav_time="00:10:13">
                <route type="links" start_link="81312" end_link="138852" trav_time="00:10:13" distance="6046.54932060571" vehicleRefId="7262234">81312</route>
            </leg>
            <activity type="work" link="138852" facility="38407" x="651680.6" y="6863892.5" start_time="08:45:22" end_time="17:15:22" >
                <attributes>
                    <attribute name="innerParis" class="java.lang.Boolean" >true</attribute>
                </attributes>
            </activity>
            <leg mode="pt" dep_time="17:15:22" trav_time="00:07:05">
                <route type="links" start_link="138852" end_link="189898" trav_time="00:07:05" distance="4604.544053407517" vehicleRefId="7262234">138852</route>
            </leg>
            <activity type="outside" link="189898" facility="outside_249" x="648729.9598002436" y="6866057.250182923" end_time="17:20:35" >
            </activity>
        </plan>
    </person>
</population>

What I want is to extract the value of person id and in case plan selected ="yes" i further want to extract all activity type and leg mode. It should be stored in the existing order as a dictionary for example (or data frame, it doesn't matter really).

So the ideal outcome would look like this:

id    leg_activity
101   outside; car; work; car; outside
102   inside; bike; work; bike; work ...
...

So far I've only worked with JMSPath and I know it is not the most suitable, so I'm happy to see other approaches with elementtree or so:) Also, I was unable to find a way to extract activityand leg information in one step. This is my approach so far:

import gzip
import xmltodict
import pandas as pd
import jmespath

box = xmltodict.parse(gzip.open(gzipfile, 'r'))

expression = jmespath.compile('population.person[].plan[?"@selected"==`yes`].activity[*].["@type"]')

coords = expression.search(box)
coords = pd.DataFrame.from_dict(coords)

Solution

  • Assuming that your xml is inside test.xml, the following should work:

    from bs4 import BeautifulSoup
    import pandas as pd
    soup = BeautifulSoup(open('test.xml'), features='lxml')
    plan_log = []
    for person in soup.find_all('person'):
        log = {'id': person.get('id')}
        activities = []
        for plan in person.find_all('plan', attrs={'selected': 'yes'}):
            for detail in plan.children:
                if detail.name == 'activity':
                    activities.append(detail.get('type'))
                elif detail.name == 'leg':
                    activities.append(detail.get('mode'))
                # activities.append(detail.get('type') or detail.get('mode'))
        log['leg_activity'] = ', '.join(activities)
        plan_log.append(log)
    df = pd.DataFrame(plan_log)
    print(df)
    

    Output:

        id                                     leg_activity
    0  101                 outside, car, work, car, outside
    1  102      inside, bike, work, bike, work, pt, outside
    2  103  inside, bike, shopping, bike, work, pt, outside