Search code examples
pandasxmldataframexml-parsingtags

Parsing Single XML to pandas Dataframe


I have a single xml string as follows ( I show a sample here. It actually consists of 10K+ such elements, but all elements are included in the myData Tag)

myxml =

        <myData>
     <x1="A" name="P1" supp="TU01" type="CA"  date="200607" proc="Y" iden="1" />
     <x1="B" name="Q1" supp="TU01" type="CA" date="200609" proc="N"  iden="5" />
     <x1="B" name="R1" supp="UY7" type="CA"  date="200609" proc="N" iden="12"  />
      </myData>

My goal is to parse this xml string and get it into a tabular pandas dataframe like structure with columns like x1, name, supp, type, date, proc & iden.

So my dataframe (output) should look like

  df = 

       x1 name supp type date    proc  iden
    
       A   P1  TU01  CA  200607   Y     1
       B   Q1  TU01  CA  200609   N     5
       B   R1  UY7   CA  200609   N     12
         

As a first step I was trying to the following , but its not working as I am not getting anything apart from the x1 from 3 elements:

         from xml.etree import ElementTree as ET

         root = ET.fromstring(myxml)

         print(root)

         for child in root.iter('*'):

              print(child.tag)

This one itself does not give me what I expect. Later I wanted to create a dict from which I was thinking to create the pandas dataframe:

          xmlDict = {}
          for parent in root:

            child = parent.getchildren()

            xmlDict[child[0].text] = child[1].text

Any help in these regards will be very much appreciated.


Solution

  • The first thing to notice is that your original XML is not well formed. Each tag should contain a tag name, whereas descendants of myData have only attributes.

    Use the following XML with some tag name, e.g. row:

    <myData>
         <row x1="A" name="P1" supp="TU01" type="CA" date="200607" proc="Y" iden="1" />
         <row x1="B" name="Q1" supp="TU01" type="CA" date="200609" proc="N" iden="5" />
         <row x1="B" name="R1" supp="UY7"  type="CA" date="200609" proc="N" iden="12" />
    </myData>
    

    Then read the source string as in your code:

    root = ET.fromstring(myxml)
    

    And to create the DataFrame, run:

    df = pd.DataFrame([ child.attrib for child in root.iter('row') ])
    

    The result is:

      x1 name  supp type    date proc iden
    0  A   P1  TU01   CA  200607    Y    1
    1  B   Q1  TU01   CA  200609    N    5
    2  B   R1   UY7   CA  200609    N   12
    

    (the leftmost column is the index).

    Update:

    If you have pandas in version 1.3.0 or later, you can do without ElementTree. Just call:

    df = pd.read_xml(myxml)