I have a single xml string as follows ( I show a sample here. It actually consists of 10K+ such elements, but all elements are included in the myData Tag)
myxml =
<myData>
<x1="A" name="P1" supp="TU01" type="CA" date="200607" proc="Y" iden="1" />
<x1="B" name="Q1" supp="TU01" type="CA" date="200609" proc="N" iden="5" />
<x1="B" name="R1" supp="UY7" type="CA" date="200609" proc="N" iden="12" />
</myData>
My goal is to parse this xml string and get it into a tabular pandas dataframe like structure with columns like x1, name, supp, type, date, proc & iden.
So my dataframe (output) should look like
df =
x1 name supp type date proc iden
A P1 TU01 CA 200607 Y 1
B Q1 TU01 CA 200609 N 5
B R1 UY7 CA 200609 N 12
As a first step I was trying to the following , but its not working as I am not getting anything apart from the x1 from 3 elements:
from xml.etree import ElementTree as ET
root = ET.fromstring(myxml)
print(root)
for child in root.iter('*'):
print(child.tag)
This one itself does not give me what I expect. Later I wanted to create a dict from which I was thinking to create the pandas dataframe:
xmlDict = {}
for parent in root:
child = parent.getchildren()
xmlDict[child[0].text] = child[1].text
Any help in these regards will be very much appreciated.
The first thing to notice is that your original XML is not well formed. Each tag should contain a tag name, whereas descendants of myData have only attributes.
Use the following XML with some tag name, e.g. row:
<myData>
<row x1="A" name="P1" supp="TU01" type="CA" date="200607" proc="Y" iden="1" />
<row x1="B" name="Q1" supp="TU01" type="CA" date="200609" proc="N" iden="5" />
<row x1="B" name="R1" supp="UY7" type="CA" date="200609" proc="N" iden="12" />
</myData>
Then read the source string as in your code:
root = ET.fromstring(myxml)
And to create the DataFrame, run:
df = pd.DataFrame([ child.attrib for child in root.iter('row') ])
The result is:
x1 name supp type date proc iden
0 A P1 TU01 CA 200607 Y 1
1 B Q1 TU01 CA 200609 N 5
2 B R1 UY7 CA 200609 N 12
(the leftmost column is the index).
Update:
If you have pandas in version 1.3.0 or later, you can do without ElementTree. Just call:
df = pd.read_xml(myxml)