I'm relatively new in Databricks. I have an XML spreadsheet data as seen below:
<?xml version="1.0" encoding="UTF-8"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<Styles>
<Style ss:ID="s22">
<NumberFormat ss:Format="General Date"/>
</Style>
<Style ss:ID="s63">
<Alignment ss:Vertical="Bottom" ss:Rotate="90"/>
</Style>
</Styles>
<Worksheet ss:Name="Test data">
<Table>
<Row ss:Index="1">
<Cell ss:Index="1" ss:StyleID="s63">
<Data ss:Type="String">ColHeader1</Data>
</Cell>
<Cell ss:Index="2" ss:StyleID="s63">
<Data ss:Type="String">ColHeader2</Data>
</Cell>
<Cell ss:Index="3" ss:StyleID="s63">
<Data ss:Type="String">ColHeader3</Data>
</Cell>
<Cell ss:Index="4" ss:StyleID="s63">
<Data ss:Type="String">ColHeader4</Data>
</Cell>
<Cell ss:Index="5" ss:StyleID="s63">
<Data ss:Type="String">ColHeader5</Data>
</Cell>
</Row>
<Row ss:Index="2">
<Cell ss:Index="1">
<Data ss:Type="String">Cell_1_1</Data>
</Cell>
<Cell ss:Index="2" ss:StyleID="s22">
<Data ss:Type="DateTime">2024-04-03T07:22:28.433</Data>
</Cell>
<Cell ss:Index="5">
<Data ss:Type="String">cell_1_5</Data>
</Cell>
</Row>
<Row ss:Index="3">
<Cell ss:Index="1">
<Data ss:Type="String">cell_2_1</Data>
</Cell>
<Cell ss:Index="2" ss:StyleID="s22">
<Data ss:Type="DateTime">2024-04-03T07:22:57.670</Data>
</Cell>
<Cell ss:Index="3">
<Data ss:Type="String">cell_2_3</Data>
</Cell>
<Cell ss:Index="4">
<Data ss:Type="String">cell_2_4</Data>
</Cell>
</Row>
</Table>
</Worksheet>
</Workbook>
I would like to load them into Databricks / Pandas dataframe so it looks like this:
|============|=========================|============|============|============|
| ColHeader1 | ColHeader2 | ColHeader3 | ColHeader4 | ColHeader5 |
|============|=========================|============|============|============|
| Cell_1_1 | 2024-04-03T07:22:28.433 | | | cell_1_5 |
| cell_2_1 | 2024-04-03T07:22:57.670 | cell_2_3 | cell_2_4 | |
|============|=========================|============|============|============|
The first "Row" will need to be adopted as the dataframe's column header name, while the subsequent Rows become the data and which column the data needs to be inserted to will be dependent on the index number in each row's Cell.
Would you be able to guide on how best to get this done please?
I tried but the result is like this:
|============|=========================|============|============|============|
| ColHeader1 | ColHeader2 | ColHeader3 | ColHeader4 | ColHeader5 |
|============|=========================|============|============|============|
| Cell_1_1 | 2024-04-03T07:22:28.433 | cell_1_5 | | |
| cell_2_1 | 2024-04-03T07:22:57.670 | cell_2_3 | cell_2_4 | |
|============|=========================|============|============|============|
Thanks in advance.
Without knowing more, I would recommend instead using openpyxl
or xlrd
to export each sheet in your excel workbook to a csv. You can find an example here
However, if you need to parse the raw xml, here is one way to achieve it.
import pandas
import xml.etree.ElementTree as ET
# Parse the XML string
root = ET.fromstring(xml_str)
# Define the XPath expression to select all cells
namespaces = {'ns' : 'urn:schemas-microsoft-com:office:spreadsheet'}
xpath_expr = "ns:Worksheet/ns:Table/"
# Extract the cell information using XPath
rows = root.findall(xpath_expr,namespaces)
data = []
for rowi, row in enumerate(rows):
for cell in row:
data.append({
'row' : rowi,
'cell': cell.attrib['{urn:schemas-microsoft-com:office:spreadsheet}Index'],
'data':cell.find('ns:Data',namespaces).text
})
# create a pandas DataFrame
temp = pandas.DataFrame(data).pivot('row','cell','data')
df = pandas.DataFrame(temp[1:].to_numpy(),columns=temp.loc[0])
df
Which gives us:
ColHeader1 | ColHeader2 | ColHeader3 | ColHeader4 | ColHeader5 | |
---|---|---|---|---|---|
0 | Cell_1_1 | 2024-04-03T07:22:28.433 | nan | nan | cell_1_5 |
1 | cell_2_1 | 2024-04-03T07:22:57.670 | cell_2_3 | cell_2_4 | nan |