I'm trying to convert a xml file into a pandas dataframe with the read_xml function. The problem is that one of the nested elements is repeated several times (but with different attributes), and it seems that pandas creates only one column for this element that gets written over every time until it gets to the last one.
Here is the contents of the xml file I used:
<?xml version="1.0" encoding="UTF-8"?>
<histdata>
<item>
<datetime>20-1-2023 00:00:00</datetime>
<value channel="Traffic Total" chid="-1">92 kbit/s</value>
<value channel="Traffic In" chid="0">77 kbit/s</value>
<value channel="Traffic Out" chid="1">16 kbit/s</value>
<value channel="Downtime" chid="-4">0 %</value>
<coverage>100 %</coverage>
</item>
<item>
<datetime>20-1-2023 00:05:00</datetime>
<value channel="Traffic Total" chid="-1">82 kbit/s</value>
<value channel="Traffic In" chid="0">727 kbit/s</value>
<value channel="Traffic Out" chid="1">18 kbit/s</value>
<value channel="Downtime" chid="-4">0 %</value>
<coverage>100 %</coverage>
</item>
</histdata>
simply reading the xml file with read_xml with:
import pandas as pd
df = pd.read_xml('chdata.xml')
df
gives: enter image description here
As you can see, there is only one column for 'value' where I would expect one column for every value element in the item tag of the xml file (maybe with the attribute in the column name to distinct them from the others).
I know I can retrieve the values of these elements with the following code:
chdata = pd.read_xml('chdata.xml', xpath='//value')
chdata
This gives me a dataframe containing all value elements, as shown here: enter image description here
What I actually need is one dataframe that shows the information from both previous df's in one table. The one 'value' column from 'df' should be replaced by four columns containing the four 'value' lines from each item in the xml. One of the attributes (channel name or id) could be used as a column name.
Preferably the table should look something like this:
datetime | chid (or name) from 1st value | same from 2nd | 3rd | 4th | coverage |
---|
Can anyone help me to achieve this? If it's possible with just the tools provided by python and pandas that would be nice, but any solution is welcome.
Consider using the names
argument per docs:
Column names for DataFrame of parsed XML data. Use this parameter to rename original element names and distinguish same named elements and attributes.
items_df = pd.read_xml(
"Input.xml",
xpath=".//item",
names = [
"datetime", "value1", "value2", "value3", "value4", "coverage"
]
)
print(items_df)
# datetime value1 value2 value3 value4 coverage
# 0 20-1-2023 00:00:00 92 kbit/s 77 kbit/s 16 kbit/s 0 % 100 %
# 1 20-1-2023 00:05:00 82 kbit/s 727 kbit/s 18 kbit/s 0 % 100 %
To retrieve the further descendants, consider iterparse
, normally reserved for very large XML files but offers an alternative to parsing XML outside of XPath.
attribs_df = pd.read_xml(
"Input.xml",
iterparse = {
"item": [
"channel", "chid", "channel", "chid",
"channel", "chid", "channel", "chid"
]
},
names = [
"channel1", "chid1", "channel2", "chid2",
"channel3", "chid3", "channel4", "chid4"
]
)
Then join the two sets:
items_df = items_df.join(attribs_df)
print(items_df)
# datetime value1 value2 value3 value4 coverage channel1 chid1 channel2 chid2 channel3 chid3 channel4 chid4
# 0 20-1-2023 00:00:00 92 kbit/s 77 kbit/s 16 kbit/s 0 % 100 % Traffic Total -1 Traffic In 0 Traffic Out 1 Downtime -4
# 1 20-1-2023 00:05:00 82 kbit/s 727 kbit/s 18 kbit/s 0 % 100 % Traffic Total -1 Traffic In 0 Traffic Out 1 Downtime -4
Note: All should be achievable with a single iterparse
call by including datetime
, value
, and coverage
but doing so repeats and affects order which may be a bug.