Search code examples
pythonxmlparsingtagsflatten

Apply outer tag attribute to inner tag - Flattening xml in python


I have a xml file like:

<plays format="tokens">
    <period number="1">
      <play/>
      <play/>
      <play/>
    </period>
    <period number="2">
      <play/>
      <play/>
      <play/>
    </period>

Each play tag contains a bunch of variables, but I would also like to add the period number as a variable to the play tags. My goal is to produce a table with each play and their attributes as well as a column that says which period that played occurred in (1 or 2).

My current code to flatten the plays out is:

d = []
for play in root.iter('play'):
    d.append(play.attrib)
    
df = pd.DataFrame(d)

This gives me every play and their attributes in the table df, but the period is not currently included in this table. Any direction would help, thank you!


Solution

  • You can do it this way with ElementTree like below-

    plays.xml

    <plays format="tokens">
        <period number="1">
          <play attr1="abc" attr2="ddd"/>
          <play attr1="cbc" attr2="ddd"/>
          <play attr1="dbc" attr2="ddd"/>
        </period>
        <period number="2">
          <play attr1="abc" attr2="ddd"/>
          <play attr1="dbc" attr2="ddd"/>
          <play attr1="kbc" attr2="ddd" />
        </period>
    </plays>
    

    main.py

    import xml.etree.ElementTree as ET
    import pandas as pd
    
    tree = ET.parse('plays.xml')
    root = tree.getroot()
    
    # find the period number for each play by searching for the parent period element
    periods = []
    for period in root.iter('period'):
        number = period.attrib['number']
        for play in period.iter('play'):
            other_attr = play.attrib
            # this line merges the other_attributes of play element(attr1, attr2) with the top attribute(number) of period element, see reference: https://stackoverflow.com/a/62820532/1138192 
            periods.append({**{"number": number}, **other_attr})
    
    df = pd.DataFrame(periods)
    print(df)
    

    Output:

      number attr1 attr2
    0      1   abc   ddd
    1      1   cbc   ddd
    2      1   dbc   ddd
    3      2   abc   ddd
    4      2   dbc   ddd
    5      2   kbc   ddd