Search code examples
pythonpandasxmlexport-to-csv

pandas & xml - How to show text of tags that are differently nested?


I'm looking for some insights on how to properly display this xml:

<?xml version="1.0" encoding="UTF-8"?>
<HEADER>
    <PRODUCT>
        <SUPPLIER>015</SUPPLIER>
        <PRODUCT_DETAILS>
            <KEYWORD>Paper</KEYWORD>
            <PRODUCT_TYPE>major</PRODUCT_TYPE>
        </PRODUCT_DETAILS>
        <PRODUCT_FEATURES>
            <REFERENCE>Class01</REFERENCE>
            <FEATURE>
                <FNAME>Colour</FNAME>
                <FVALUE>white</FVALUE>
            </FEATURE>
        </PRODUCT_FEATURES>
    </PRODUCT>
</HEADER>

For a simpler structure, that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<HEADER>
    <PRODUCT_DETAILS>
        <KEYWORD>Paper</KEYWORD>
        <PRODUCT_TYPE>major</PRODUCT_TYPE>
    </PRODUCT_DETAILS>
    <PRODUCT_FEATURES>
        <FEATURE>
            <FNAME>Colour</FNAME>
            <FVALUE>white</FVALUE>
        </FEATURE>
    </PRODUCT_FEATURES>
</HEADER>

I've written a few lines that look like this:

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse('file.xml')
root = tree.getroot()

df = pd.DataFrame()

for i in range(0, len(root), 5):

    details = [(child.tag, child.text) for child in root[i + 0]]
    features = [(child[0].text, child[1].text) for child in root[i + 1]]

    temp_df = pd.DataFrame([[i[1] for i in details + features]], columns=[i[0] for i in details + features])

    df = pd.concat([df, temp_df])

df

# df.to_csv("file_export.csv", index=False)

... and produce this output:

    KEYWORD PRODUCT_TYPE    Colour
0   Paper   major           white

What edits do I need to make, to output this:

    SUPPLIER    KEYWORD PRODUCT_TYPE    REFERENCE   Colour
0   015         Paper   major           Class01     white

Thank you for your help!

Best, ~C


Solution

  • Here is one way to do it:

    import xml.etree.ElementTree as ET
    
    import pandas as pd
    
    tree = ET.parse("file.xml")
    root = tree.getroot()
    
    data = {
        "SUPPLIER": [],
        "KEYWORD": [],
        "PRODUCT_TYPE": [],
        "REFERENCE": [],
        "Colour": [],
    }
    
    for product in root:
        data["SUPPLIER"].append(product[0].text)
        data["KEYWORD"].append(product[1][0].text)
        data["PRODUCT_TYPE"].append(product[1][1].text)
        data["REFERENCE"].append(product[2][0].text)
        data["Colour"].append(product[2][1][1].text)
    
    df = pd.DataFrame(data)
    
    print(df)
    # Output
      SUPPLIER KEYWORD PRODUCT_TYPE REFERENCE Colour
    0      015   Paper        major   Class01  white