Search code examples
pythonxmldataframepysparkdatabricks

How to load XML spreadsheet with jumping column index numbers to Databricks/Pandas dataframe


I'm relatively new in Databricks. I have an XML spreadsheet data as seen below:

<?xml version="1.0" encoding="UTF-8"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
    <Styles>
        <Style ss:ID="s22">
            <NumberFormat ss:Format="General Date"/>
        </Style>
        <Style ss:ID="s63">
            <Alignment ss:Vertical="Bottom" ss:Rotate="90"/>
        </Style>
    </Styles>
    <Worksheet ss:Name="Test data">
        <Table>
            <Row ss:Index="1">
                <Cell ss:Index="1" ss:StyleID="s63">
                    <Data ss:Type="String">ColHeader1</Data>
                </Cell>
                <Cell ss:Index="2" ss:StyleID="s63">
                    <Data ss:Type="String">ColHeader2</Data>
                </Cell>
                <Cell ss:Index="3" ss:StyleID="s63">
                    <Data ss:Type="String">ColHeader3</Data>
                </Cell>
                <Cell ss:Index="4" ss:StyleID="s63">
                    <Data ss:Type="String">ColHeader4</Data>
                </Cell>
                <Cell ss:Index="5" ss:StyleID="s63">
                    <Data ss:Type="String">ColHeader5</Data>
                </Cell>
            </Row>
            <Row ss:Index="2">
                <Cell ss:Index="1">
                    <Data ss:Type="String">Cell_1_1</Data>
                </Cell>
                <Cell ss:Index="2" ss:StyleID="s22">
                    <Data ss:Type="DateTime">2024-04-03T07:22:28.433</Data>
                </Cell>
                <Cell ss:Index="5">
                    <Data ss:Type="String">cell_1_5</Data>
                </Cell>
            </Row>
            <Row ss:Index="3">
                <Cell ss:Index="1">
                    <Data ss:Type="String">cell_2_1</Data>
                </Cell>
                <Cell ss:Index="2" ss:StyleID="s22">
                    <Data ss:Type="DateTime">2024-04-03T07:22:57.670</Data>
                </Cell>
                <Cell ss:Index="3">
                    <Data ss:Type="String">cell_2_3</Data>
                </Cell>
                <Cell ss:Index="4">
                    <Data ss:Type="String">cell_2_4</Data>
                </Cell>
            </Row>
        </Table>
    </Worksheet>
</Workbook>

I would like to load them into Databricks / Pandas dataframe so it looks like this:

|============|=========================|============|============|============|
| ColHeader1 | ColHeader2              | ColHeader3 | ColHeader4 | ColHeader5 |
|============|=========================|============|============|============|
| Cell_1_1   | 2024-04-03T07:22:28.433 |            |            | cell_1_5   |
| cell_2_1   | 2024-04-03T07:22:57.670 | cell_2_3   | cell_2_4   |            |
|============|=========================|============|============|============|

The first "Row" will need to be adopted as the dataframe's column header name, while the subsequent Rows become the data and which column the data needs to be inserted to will be dependent on the index number in each row's Cell.

Would you be able to guide on how best to get this done please?

I tried but the result is like this:

|============|=========================|============|============|============|
| ColHeader1 | ColHeader2              | ColHeader3 | ColHeader4 | ColHeader5 |
|============|=========================|============|============|============|
| Cell_1_1   | 2024-04-03T07:22:28.433 | cell_1_5   |            |            |
| cell_2_1   | 2024-04-03T07:22:57.670 | cell_2_3   | cell_2_4   |            |
|============|=========================|============|============|============|

Thanks in advance.


Solution

  • Without knowing more, I would recommend instead using openpyxl or xlrd to export each sheet in your excel workbook to a csv. You can find an example here

    However, if you need to parse the raw xml, here is one way to achieve it.

    import pandas
    import xml.etree.ElementTree as ET
    
    # Parse the XML string
    root = ET.fromstring(xml_str)
    
    # Define the XPath expression to select all cells
    namespaces = {'ns' : 'urn:schemas-microsoft-com:office:spreadsheet'}
    xpath_expr = "ns:Worksheet/ns:Table/"
    
    # Extract the cell information using XPath
    rows = root.findall(xpath_expr,namespaces)
    
    data = []
    for rowi, row in enumerate(rows):
      for cell in row:
        data.append({
          'row' : rowi,
          'cell': cell.attrib['{urn:schemas-microsoft-com:office:spreadsheet}Index'],
          'data':cell.find('ns:Data',namespaces).text
        })
    
    # create a pandas DataFrame
    temp = pandas.DataFrame(data).pivot('row','cell','data')
    df = pandas.DataFrame(temp[1:].to_numpy(),columns=temp.loc[0])
    df
    

    Which gives us:

    ColHeader1 ColHeader2 ColHeader3 ColHeader4 ColHeader5
    0 Cell_1_1 2024-04-03T07:22:28.433 nan nan cell_1_5
    1 cell_2_1 2024-04-03T07:22:57.670 cell_2_3 cell_2_4 nan