Search code examples
pythonpandasxmljupyterexport-to-csv

Nested xml attributes & text don't show in df using pandas


I am new to Python and have a file.xml with the following structure:

<?xml version="1.0" encoding="UTF-8"?>
<HEADER>
    <PRODUCT_DETAILS>
        <DESCRIPTION_SHORT>blue dog w short hair</DESCRIPTION_SHORT>
        <DESCRIPTION_LONG>blue dog w short hair and unlimitied zoomies</DESCRIPTION_LONG>
    </PRODUCT_DETAILS>
    <PRODUCT_FEATURES>
        <FEATURE>
            <FNAME>Hair</FNAME>
            <FVALUE>short</FVALUE>
        </FEATURE>
        <FEATURE>
            <FNAME>Colour</FNAME>
            <FVALUE>blue</FVALUE>
        </FEATURE>
        <FEATURE>
            <FNAME>Legs</FNAME>
            <FVALUE>4</FVALUE>
        </FEATURE>
    </PRODUCT_FEATURES>
</HEADER>

I am using a very simple snippet (below) to turn it into file_export.csv:

import pandas as pd

df = pd.read_xml("file.xml")

# df

df.to_csv("file_export.csv", index=False)

The problem is that I end up with a table like this:

DESCRIPTION_SHORT       DESCRIPTION_LONG                                FEATURE
blue dog w short hair   blue dog w short hair and unlimitied zoomies    NaN

I tried removing the FEATURE attribute but ended up overwriting(?) previous FNAME and FVALUE with the last one, assuming because they are called the same:

DESCRIPTION_SHORT       DESCRIPTION_LONG                                FNAME   FVALUE
blue dog w short hair   blue dog w short hair and unlimitied zoomies    None    NaN
None                    None                                            Legs    4.0

What do I need to add to my code to show the nested attributes including their text? Like this:

DESCRIPTION_SHORT       DESCRIPTION_LONG                                FEATURE FNAME   FVALUE
blue dog w short hair   blue dog w short hair and unlimitied zoomies    NaN     Hair    short
blue dog w short hair   blue dog w short hair and unlimitied zoomies    NaN     Colour  blue
blue dog w short hair   blue dog w short hair and unlimitied zoomies    NaN     Legs    4

Thank you in advance!!

~ C


Solution

  • First, the sample xml in your question (and probably your actual xml) doesn't really lend itself to read_xml(). In this case you are probably better off using an actual xml parser and handing the output over to pandas.

    In addition, I don't think your desired output is very efficient - in your example, you repeat each of the long and short description 3 times, for no apparent reason.

    Having said all that, I would suggest something like this:

    Assuming your actual xml has more than one pet, something like:

    inventory="""<?xml version="1.0" encoding="UTF-8"?>
    <doc>
    <HEADER>
        <PRODUCT_DETAILS>
            <DESCRIPTION_SHORT>green cat w short hair</DESCRIPTION_SHORT>
            <DESCRIPTION_LONG>green cat w short hair and unlimitied zoomies</DESCRIPTION_LONG>
        </PRODUCT_DETAILS>
        <PRODUCT_FEATURES>
            <FEATURE>
                <FNAME>Hair</FNAME>
                <FVALUE>medium</FVALUE>
            </FEATURE>
            <FEATURE>
                <FNAME>Colour</FNAME>
                <FVALUE>green</FVALUE>
            </FEATURE>
            <FEATURE>
                <FNAME>Legs</FNAME>
                <FVALUE>14</FVALUE>
            </FEATURE>
        </PRODUCT_FEATURES>
    </HEADER>
    ****the HEADER in your question goes here***
    </doc>"""
    
    from lxml import etree
    import pandas as pd
    
    doc = etree.XML(inventory.encode())
    pets = doc.xpath('//HEADER')
    
    headers=[elem.tag for elem in doc.xpath('//HEADER[1]//PRODUCT_DETAILS//*')]
    headers.extend(doc.xpath('//HEADER[1]//FNAME/text()'))
    
    rows = []
    
    for pet in pets:
    
        row = [pet.xpath(f'.//{headers[0]}/text()')[0],pet.xpath(f'.//{headers[1]}/text()')[0]]
        f_values = pet.xpath('.//FVALUE/text()')
        row.extend(f_values)    
        rows.append(row)
    

    If you want to be even more adventurous and use xpath 2.0 (which lxml doesn't support) as well as more list comprehensions, you can try this:

    from elementpath import select
    
    expression1 = '//HEADER[1]/string-join((./PRODUCT_DETAILS//*/name(),./PRODUCT_FEATURES//FNAME),",")'
    expression2 = '//HEADER/string-join((./PRODUCT_DETAILS//*,./PRODUCT_FEATURES//FVALUE),",")'
    headers = [h.split(',') for h in select(doc, expression1 )]
    rows= [r.split(',') for r in select(doc, expression2)]
    

    In either case:

    pd.DataFrame(rows,columns=headers)
    

    should output:

           DESCRIPTION_SHORT    DESCRIPTION_LONG                                 Hair   Colour  Legs
    0   green cat w short hair  green cat w short hair and unlimitied zoomies   medium  green   14
    1   blue dog w long hair    blue dog w long hair and limitied zoomies   short   blue    4