I am having trouble in importing an XML file to Pandas where I need to grab data from two parent nodes. One parent node (AgentID
) has data directly in it, and the other (Sales
) has child nodes (Location
, Size
, Status
) that contain data, as given below.
test_xml = '''<TEST_XML>
<Sales>
<AgentID>0001</AgentID>
<Sale>
<Location>0</Location>
<Size>1000</Size>
<Status>Available</Status>
</Sale>
<Sale>
<Location>1</Location>
<Size>500</Size>
<Status>Unavailable</Status>
</Sale>
</Sales>
</TEST_XML>'''
when I try to import this to a Pandas Dataframe below is the only way I was able to grab data under the Sale
tag.
import pandas as pd
df = pd.read_xml(test_xml, xpath='//Sale')
which gives me a dataframe like the one below:
Location Size Status
0 0 1000 Available
1 1 500 Unavailable
What I need is including the AgentID
tag in the DataFrame too, to get the following, but I was unsuccessful. Expected output is given below for clarity:
AgentID Location Size Status
0 0001 0 1000 Available
1 0001 1 500 Unavailable
Is there a way to manipulate the xpath
parameter to include the data inside the AgentID
tag as well, or is it impossible to do it using Pandas' read_xml
function? I tried passing a list like xpath=['//AgentID', '//Sale']
but of course, it did not work...
I don't think you can get the desired output using just read_xml()
; however, it's possible by manipulating it a bit. Essentially, the idea is to get everything from the xml using a generic xpath, select the required columns, populate the AgentID
column to corresponding to Sale
columns; then remove redundant rows.
df = pd.read_xml(io.StringIO(test_xml), xpath='//*', dtype=str)[['AgentID', 'Location', 'Size', 'Status']]
df['AgentID'] = df['AgentID'].ffill()
df = df.dropna(how='any').astype({'Location': int, 'Size': int}).reset_index(drop=True)
An "easier" solution to get a parent node (although not related to the exact question in the OP) is to convert the XML into a Python dictionary and normalize it into dataframe using pd.json_normalize
. This works because meta fields (in this case AgentID
) can be specified here. However, we need to install a third-party library (xmltodict
) to achieve the first step.
!pip install xmltodict
import xmltodict
df = (
pd.json_normalize(xmltodict.parse(test_xml)['TEST_XML']['Sales'],
record_path=['Sale'], meta=['AgentID'])
[['AgentID', 'Location', 'Size', 'Status']]
)