Search code examples
pythonxmlpycharmelementtree

XML, ElementTree - Extract attributes and match them to on ID


Hello everyone and greetings from germany!

I'm rather new to python and i have a question concerning XML-files. My data looks something like this (there are a lot of elements in this file, each with a unique way-id):

    <way id="4260867" visible="true" version="12" changeset="71461925" timestamp="2019-06-        
      20T21:42:40Z" user="L___I" uid="7649834">
      <nd ref="25550395"/>
      <nd ref="25550396"/>
      <tag k="bicycle" v="no"/>
      <tag k="bridge" v="yes"/>
      <tag k="foot" v="no"/>
      <tag k="hazmat" v="designated"/>
      <tag k="highway" v="motorway_link"/>
      <tag k="maxspeed" v="none"/>
      <tag k="motorcar" v="yes"/>
      <tag k="oneway" v="yes"/>
      <tag k="placement" v="middle_of:1"/>
      <tag k="source:maxspeed" v="DE:motorway"/>
     </way>
     <way id="312407268" visible="true" version="9" changeset="116383142" 
      timestamp="2022-01-20T12:11:26Z" user="m_p_13" uid="2465271">
      <nd ref="7792523927"/>
      <nd ref="25393142"/>
      <nd ref="5583629192"/>
      <nd ref="25393143"/>
      <tag k="bdouble" v="yes"/>
      <tag k="bicycle" v="no"/>
      <tag k="foot" v="yes"/>
      <tag k="highway" v="secondary"/>
      <tag k="horse" v="yes"/>
      <tag k="lanes" v="2"/>
      <tag k="maxspeed" v="60"/>
      <tag k="motorcar" v="yes"/>
      <tag k="name" v="Messe-Allee"/>
      <tag k="name:etymology:wikidata" v="Q57305"/>
      <tag k="oneway" v="yes"/>
      <tag k="ref" v="K 6529"/>
      <tag k="shoulder" v="no"/>
      <tag k="surface" v="asphalt"/>
     </way>
     <way id="106141287" visible="true" version="3" changeset="101880267" timestamp="2021-03- 
      28T16:10:05Z" user="user_2954791" uid="2954791">
      <nd ref="913936737"/>
      <nd ref="1222080363"/>
      <tag k="bicycle" v="designated"/>
      <tag k="cycleway" v="crossing"/>
      <tag k="smoothness" v="intermediate"/>
      <tag k="surface" v="paving_stones"/>
      <tag k="traffic_sign" v="DE:241"/>
     </way>

What i want to do is extract every ID and match the attributes "nd ref" (node_ids, number differs from way_id to way_id) and (if contains the value "blub"

So in the end it should look something like this:

(id, node_ids, maxspeed)
(4260867, (25550395,25550396), None)
(106141287, (913936737, 1222080363), NaN)

I started to work with elementTree and was able to extract the IDs. I can also print out all attribs from via

for way in root.findall('way'):
   for i in way.findall('tag'): print(i.attrib)

But I'm not able to get those values in the form that i want.

I hope i can get some help! Also if someone has a better way to organize the data instead of tuple i would appreciate that! I dont know if it is important or not but i work with Pycharm.

Thank you in advance!


Solution

  • If I understand you correctly, you are probably looking for something like the below. I chose to run it through pandas, just to demonstrate the structure, but obviously you can do something else if you so choose.

    import xml.etree.ElementTree as ET
    import pandas as pd
    
    ways = """[your xml above, wrapped in a root element]"""
    doc = ET.fromstring(ways)
    targets = doc.findall('.//way')
    rows= []
    cols = ["id", "node_ids", "maxspeed"]
    for target in targets:
        id = target.attrib['id']
        nds = [nd.attrib['ref'] for nd in target.findall('.//nd') ]
        ms = target.find(".//tag[@k='maxspeed']").attrib['v'] if target.find(".//tag[@k='maxspeed']") is not None else None
        rows.append([id,nds,ms])
    df = pd.DataFrame(rows, columns=cols)
    df
    

    Output:

        id  node_ids    maxspeed
    0   4260867     [25550395, 25550396]    none
    1   312407268   [7792523927, 25393142, 5583629192, 25393143]    60
    2   106141287   [913936737, 1222080363]     None
    

    Note: this would be somewhat simpler if you use lxml instead of ElementTree.