Search code examples
pythonpandasxmljupyterexport-to-csv

How do I correctly display xml structure with pandas?


I'm looking for some insights on how to properly display this xml:

<?xml version="1.0" encoding="UTF-8"?>
<HEADER>
    <PRODUCT>
        <SUPPLIER>015</SUPPLIER>
        <PRODUCT_DETAILS>
            <KEYWORD>Paper</KEYWORD>
            <PRODUCT_TYPE>major</PRODUCT_TYPE>
        </PRODUCT_DETAILS>
        <PRODUCT_FEATURES>
            <REFERENCE>Class01</REFERENCE>
            <FEATURE>
                <FNAME>Colour</FNAME>
                <FVALUE>white</FVALUE>
            </FEATURE>
        </PRODUCT_FEATURES>
    </PRODUCT>
</HEADER>

For a simpler structure, that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<HEADER>
    <PRODUCT_DETAILS>
        <KEYWORD>Paper</KEYWORD>
        <PRODUCT_TYPE>major</PRODUCT_TYPE>
    </PRODUCT_DETAILS>
    <PRODUCT_FEATURES>
        <FEATURE>
            <FNAME>Colour</FNAME>
            <FVALUE>white</FVALUE>
        </FEATURE>
    </PRODUCT_FEATURES>
</HEADER>

I've written a few lines that look like this:

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse('file.xml')
root = tree.getroot()

df = pd.DataFrame()

for i in range(0, len(root), 2):

    details = [(child.tag, child.text) for child in root[i + 0]]
    features = [(child[0].text, child[1].text) for child in root[i + 1]]

    temp_df = pd.DataFrame([[i[1] for i in details + features]], columns=[i[0] for i in details + features])

    df = pd.concat([df, temp_df])

df

# df.to_csv("file_export.csv", index=False)

... and produce this output:

    KEYWORD PRODUCT_TYPE    Colour
0   Paper   major           white

What edits do I need to make, to output this:

    SUPPLIER    KEYWORD PRODUCT_TYPE    REFERENCE   Colour
0   015         Paper   major           Class01     white

Thank you for your help!

Best, ~C


Solution

  • The below will do the job

    import xml.etree.ElementTree as ET
    import pandas as pd
    
    xml = '''<?xml version="1.0" encoding="UTF-8"?>
    <HEADER>
        <PRODUCT>
            <SUPPLIER>015</SUPPLIER>
            <PRODUCT_DETAILS>
                <KEYWORD>Paper</KEYWORD>
                <PRODUCT_TYPE>major</PRODUCT_TYPE>
            </PRODUCT_DETAILS>
            <PRODUCT_FEATURES>
                <REFERENCE>Class01</REFERENCE>
                <FEATURE>
                    <FNAME>Colour</FNAME>
                    <FVALUE>white</FVALUE>
                </FEATURE>
            </PRODUCT_FEATURES>
        </PRODUCT>
    </HEADER>'''
    
    elements = ['SUPPLIER','KEYWORD','PRODUCT_TYPE','REFERENCE','FNAME','FVALUE']
    root = ET.fromstring(xml)
    data = {e:root.find(f'.//{e}').text for e in elements}
    data[data['FNAME']] = data['FVALUE']
    del data['FVALUE']
    del data['FNAME']
    
    df = pd.DataFrame([data])
    print(df)
    

    output

      SUPPLIER KEYWORD PRODUCT_TYPE REFERENCE Colour
    0      015   Paper        major   Class01  white