How to load XML spreadsheet with jumping column index numbers to Databricks/Pandas dataframe

I'm relatively new in Databricks. I have an XML spreadsheet data as seen below:

<?xml version="1.0" encoding="UTF-8"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
    <Styles>
        <Style ss:ID="s22">
            <NumberFormat ss:Format="General Date"/>
        </Style>
        <Style ss:ID="s63">
            <Alignment ss:Vertical="Bottom" ss:Rotate="90"/>
        </Style>
    </Styles>
    <Worksheet ss:Name="Test data">
        <Table>
            <Row ss:Index="1">
                <Cell ss:Index="1" ss:StyleID="s63">
                    <Data ss:Type="String">ColHeader1</Data>
                </Cell>
                <Cell ss:Index="2" ss:StyleID="s63">
                    <Data ss:Type="String">ColHeader2</Data>
                </Cell>
                <Cell ss:Index="3" ss:StyleID="s63">
                    <Data ss:Type="String">ColHeader3</Data>
                </Cell>
                <Cell ss:Index="4" ss:StyleID="s63">
                    <Data ss:Type="String">ColHeader4</Data>
                </Cell>
                <Cell ss:Index="5" ss:StyleID="s63">
                    <Data ss:Type="String">ColHeader5</Data>
                </Cell>
            </Row>
            <Row ss:Index="2">
                <Cell ss:Index="1">
                    <Data ss:Type="String">Cell_1_1</Data>
                </Cell>
                <Cell ss:Index="2" ss:StyleID="s22">
                    <Data ss:Type="DateTime">2024-04-03T07:22:28.433</Data>
                </Cell>
                <Cell ss:Index="5">
                    <Data ss:Type="String">cell_1_5</Data>
                </Cell>
            </Row>
            <Row ss:Index="3">
                <Cell ss:Index="1">
                    <Data ss:Type="String">cell_2_1</Data>
                </Cell>
                <Cell ss:Index="2" ss:StyleID="s22">
                    <Data ss:Type="DateTime">2024-04-03T07:22:57.670</Data>
                </Cell>
                <Cell ss:Index="3">
                    <Data ss:Type="String">cell_2_3</Data>
                </Cell>
                <Cell ss:Index="4">
                    <Data ss:Type="String">cell_2_4</Data>
                </Cell>
            </Row>
        </Table>
    </Worksheet>
</Workbook>

I would like to load them into Databricks / Pandas dataframe so it looks like this:

|============|=========================|============|============|============|
| ColHeader1 | ColHeader2              | ColHeader3 | ColHeader4 | ColHeader5 |
|============|=========================|============|============|============|
| Cell_1_1   | 2024-04-03T07:22:28.433 |            |            | cell_1_5   |
| cell_2_1   | 2024-04-03T07:22:57.670 | cell_2_3   | cell_2_4   |            |
|============|=========================|============|============|============|

The first "Row" will need to be adopted as the dataframe's column header name, while the subsequent Rows become the data and which column the data needs to be inserted to will be dependent on the index number in each row's Cell.

Would you be able to guide on how best to get this done please?

I tried but the result is like this:

|============|=========================|============|============|============|
| ColHeader1 | ColHeader2              | ColHeader3 | ColHeader4 | ColHeader5 |
|============|=========================|============|============|============|
| Cell_1_1   | 2024-04-03T07:22:28.433 | cell_1_5   |            |            |
| cell_2_1   | 2024-04-03T07:22:57.670 | cell_2_3   | cell_2_4   |            |
|============|=========================|============|============|============|

Thanks in advance.

Solution

Without knowing more, I would recommend instead using openpyxl or xlrd to export each sheet in your excel workbook to a csv. You can find an example here

However, if you need to parse the raw xml, here is one way to achieve it.

import pandas
import xml.etree.ElementTree as ET

# Parse the XML string
root = ET.fromstring(xml_str)

# Define the XPath expression to select all cells
namespaces = {'ns' : 'urn:schemas-microsoft-com:office:spreadsheet'}
xpath_expr = "ns:Worksheet/ns:Table/"

# Extract the cell information using XPath
rows = root.findall(xpath_expr,namespaces)

data = []
for rowi, row in enumerate(rows):
  for cell in row:
    data.append({
      'row' : rowi,
      'cell': cell.attrib['{urn:schemas-microsoft-com:office:spreadsheet}Index'],
      'data':cell.find('ns:Data',namespaces).text
    })

# create a pandas DataFrame
temp = pandas.DataFrame(data).pivot('row','cell','data')
df = pandas.DataFrame(temp[1:].to_numpy(),columns=temp.loc[0])
df

Which gives us:

	ColHeader1	ColHeader2	ColHeader3	ColHeader4	ColHeader5
0	Cell_1_1	2024-04-03T07:22:28.433	nan	nan	cell_1_5
1	cell_2_1	2024-04-03T07:22:57.670	cell_2_3	cell_2_4	nan