Search code examples
pythonpandasxmldataframe

Importing data from two XML parent nodes to a Pandas DataFrame using read_xml


I am having trouble in importing an XML file to Pandas where I need to grab data from two parent nodes. One parent node (AgentID) has data directly in it, and the other (Sales) has child nodes (Location, Size, Status) that contain data, as given below.

test_xml = '''<TEST_XML>
        <Sales>
            <AgentID>0001</AgentID>
            <Sale>
                <Location>0</Location>
                <Size>1000</Size>
                <Status>Available</Status>
            </Sale>
            <Sale>
                <Location>1</Location>
                <Size>500</Size>
                <Status>Unavailable</Status>
            </Sale>
        </Sales>
    </TEST_XML>'''

when I try to import this to a Pandas Dataframe below is the only way I was able to grab data under the Sale tag.

import pandas as pd
df = pd.read_xml(test_xml, xpath='//Sale')

which gives me a dataframe like the one below:

    Location    Size    Status
0   0           1000    Available
1   1           500     Unavailable

What I need is including the AgentID tag in the DataFrame too, to get the following, but I was unsuccessful. Expected output is given below for clarity:

    AgentID     Location    Size    Status
0   0001        0           1000    Available
1   0001        1           500     Unavailable

Is there a way to manipulate the xpath parameter to include the data inside the AgentID tag as well, or is it impossible to do it using Pandas' read_xml function? I tried passing a list like xpath=['//AgentID', '//Sale'] but of course, it did not work...


Solution

  • I don't think you can get the desired output using just read_xml(); however, it's possible by manipulating it a bit. Essentially, the idea is to get everything from the xml using a generic xpath, select the required columns, populate the AgentID column to corresponding to Sale columns; then remove redundant rows.

    df = pd.read_xml(io.StringIO(test_xml), xpath='//*', dtype=str)[['AgentID', 'Location', 'Size', 'Status']]
    df['AgentID'] = df['AgentID'].ffill()
    df = df.dropna(how='any').astype({'Location': int, 'Size': int}).reset_index(drop=True)
    

    output


    An "easier" solution to get a parent node (although not related to the exact question in the OP) is to convert the XML into a Python dictionary and normalize it into dataframe using pd.json_normalize. This works because meta fields (in this case AgentID) can be specified here. However, we need to install a third-party library (xmltodict) to achieve the first step.

    !pip install xmltodict
    import xmltodict
    df = (
        pd.json_normalize(xmltodict.parse(test_xml)['TEST_XML']['Sales'], 
                          record_path=['Sale'], meta=['AgentID'])
        [['AgentID', 'Location', 'Size', 'Status']]
    )