Search code examples
pythonxmldataframe

Can we convert asymmetrical XML file to a symmetrical one using Python?


I have a really big XML file with the following structure:

<file>
  <fields>
    <lastname>LASTNAME1</lastname>
    <name>NAME1</name>
    <id>ID1</id>
    <wage>WAGE1</wage>
    <another-price>PRICE1</another-price>
    <method>METHOD1</method>
  </fields>
  <fields>
    <lastname>LASTNAME2</lastname>
    <name>NAME2</name>
    <id>ID2</id>
    <another-price>PRICE2</another-price>
    <method>METHOD2</method>
  </fields>
  <fields>
    <lastname>LASTNAME3</lastname>
    <id>ID3</id>
    <wage>WAGE3</wage>
    <another-price>PRICE3</another-price>
  </fields>
</file>

As you can see in this dummy data first field has all fields with a value, on the other hand second field is missing wage field and also the third one is missing both name and method fields. I'm using another Python code to convert all these in a table making the field names column names and the data as their row data. But when my XML is missing any fields my original loop can't keep track of column names and it just adds second field's another price data to wage column for example. I think I managed to explain my situation.

I tried like grouping by those field names within the code but I couldn't succeed it, so I thought of a solution like if there is a code that can convert my XML to symmetrical XML file with null values (or anything really) I'd stop encountering this issue. Waiting to hear your suggestions and solutions!


Solution

  • See below (the idea is to build a set of fields first)

    import xml.etree.ElementTree as ET
    
    xml = '''<file>
      <fields>
        <lastname>LASTNAME1</lastname>
        <name>NAME1</name>
        <id>ID1</id>
        <wage>WAGE1</wage>
        <another-price>PRICE1</another-price>
        <method>METHOD1</method>
      </fields>
      <fields>
        <lastname>LASTNAME2</lastname>
        <name>NAME2</name>
        <id>ID2</id>
        <another-price>PRICE2</another-price>
        <method>METHOD2</method>
      </fields>
      <fields>
        <lastname>LASTNAME3</lastname>
        <id>ID3</id>
        <wage>WAGE3</wage>
        <another-price>PRICE3</another-price>
      </fields>
    </file>'''
    
    root = ET.fromstring(xml)
    fields_set = set()
    fields_list = root.findall('.//fields')
    for fields_entry in fields_list:
        for field in fields_entry:
            fields_set.add(field.tag)
    print(f'fields set: {fields_set}')
    for fields_entry in fields_list:
        data = {}
        for field in fields_set:
            element = fields_entry.find(field)
            value = element.text if element is not None else 'N/A'
            data[field] = value
        print(data)
    

    output

    fields set: {'another-price', 'name', 'id', 'lastname', 'wage', 'method'}
    {'another-price': 'PRICE1', 'name': 'NAME1', 'id': 'ID1', 'lastname': 'LASTNAME1', 'wage': 'WAGE1', 'method': 'METHOD1'}
    {'another-price': 'PRICE2', 'name': 'NAME2', 'id': 'ID2', 'lastname': 'LASTNAME2', 'wage': 'N/A', 'method': 'METHOD2'}
    {'another-price': 'PRICE3', 'name': 'N/A', 'id': 'ID3', 'lastname': 'LASTNAME3', 'wage': 'WAGE3', 'method': 'N/A'}