Search code examples
pythonpandasxml

read xml file, convert it to table (dataframe)


this the first time I am dealing with xml file, so I am very lost. I would appreciate any help. All I want is to read the file and convert it to regular table (dataframe).

I have file with this structure:

<?xml version="1.0"  encoding='UTF-8'?>
<LucroCliente xmlns:my='http://www.ms.com/pace' xmlns='http://www.ms.com/pace' Cab_Usuario='UsuI' Cab_DadosEmpresa='' Cab_RazaoEmpresa='CoLtda.' Cab_Aplicativo='Comercial' Cab_Data='25/07/2022 14:40:38' Cab_Titulo='Relatório Cab_Titulo' Selecao='Selecao' Periodo='Período: 01/01/2020 - 31/12/2020'>
<Filial Filial=''>
<Linha TotalLinha='TOA 21:  2.313.292,43'>
<Produto Coluna1='21-851611' Coluna2='CAMIO VO' Coluna3='' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3667984' Coluna2='SCA4X2' Coluna3='-1' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3667994' Coluna2='SCA963' Coluna3='-1' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3676543' Coluna2='SCA713' Coluna3='-1' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3676601' Coluna2='SCA97' Coluna3='-1' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3814014' Coluna2='CAMIX2' Coluna3='' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3814087' Coluna2='SCA56' Coluna3='' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='19/06/20' Coluna7='01' Coluna8='EP 202022777' Coluna9='1 UN' Coluna10='195.000,00' Coluna11='195.000,00' Coluna12='1' Coluna13='195.000,00' Coluna14='195.000,00' Coluna15='NF9' Coluna16='10203910A'/>
<AnaliseDiaria Coluna6='13/07/20' Coluna7='01' Coluna8='RCP G 41765' Coluna9='0 UN' Coluna10='' Coluna11='90,00' Coluna12='1' Coluna13='195.090,00' Coluna14='195.090,00' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='27/07/20' Coluna7='01' Coluna8='RCP G 41767' Coluna9='0 UN' Coluna10='' Coluna11='180,00' Coluna12='1' Coluna13='195.270,00' Coluna14='195.270,00' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='27/07/20' Coluna7='01' Coluna8='RCP G 41768' Coluna9='0 UN' Coluna10='' Coluna11='212,60' Coluna12='1' Coluna13='195.482,60' Coluna14='195.482,60' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='27/07/20' Coluna7='01' Coluna8='RCP G 41770' Coluna9='0 UN' Coluna10='' Coluna11='145,20' Coluna12='1' Coluna13='195.627,80' Coluna14='195.627,80' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='27/07/20' Coluna7='01' Coluna8='RCP G 41771' Coluna9='0 UN' Coluna10='' Coluna11='8.902,02' Coluna12='1' Coluna13='204.529,82' Coluna14='204.529,82' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='27/07/20' Coluna7='01' Coluna8='VP 323755' Coluna9='-1 UN' Coluna10='204.529,82' Coluna11='-204.529,82' Coluna12='0' Coluna13='' Coluna14='' Coluna15='' Coluna16='158PES'/>
</Produto>
</Linha>
</Filial>
</LucroCliente>

I tried multiple solutions I found here but nothing worked out, for example: first solution:

xml_data = open('file.xml', 'r').read()
root = et.XML(xml_data)  # Parse XML

data = []
cols = []
for i, child in enumerate(root):
    data.append([subchild.text for subchild in child])
    cols.append(child.tag)
df = pd.DataFrame(data).T
df.columns = cols

second solution:

xml_data = objectify.parse('file.xml') 
root = xml_data.getroot()  

data = []
cols = []
for i in range(len(root.getchildren())):
    child = root.getchildren()[i]
    data.append([subchild.text for subchild in child.getchildren()])
    cols.append(child.tag)
df = pd.DataFrame(data).T
df.columns = cols

My end table will look like below:

| Coluna1   | Coluna2  | Coluna3 | Coluna4 | coluna2 | couna6 | coluna7 | coluna8 | coluna9 | colun10 | coluna11 | coluna12 | coluna13 | coluna14 | coluna15 | coluna16 |
| --------- | -------- | ------- | ------- | ------- | ------ | ------- | ------- | ------- | ------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 21-851611 | CAMIO VO |         |         |         |        |         |         |         |         |          |          |          |          |          |          |
|           |          |         |         |         |        |         |         |         |         |          |          |          |          |          |          |
|           |          |         |         |         |        |         |         |         |         |          |          |          |          |          |          |
|           |          |         |         |         |        |         |         |         |         |          |          |          |          |          |          |

Solution

  • Fortunately, in the case of your xml in the question, you can use the pandas read_xml() method, although you'll have to skirt around the namespaces issue:

    import pandas as pd
    pd.read_xml(file.xml,xpath='//*[local-name()="Linha"]//*[local-name()="Produto"]')
    

    Output:

        Coluna1        Coluna2    Coluna3     Coluna4   Coluna5     {http://www.ms.com/pace}AnaliseDiaria
    0   21-851611   CAMIO VO    NaN     NaN     NaN     NaN
    1   21-3667984  SCA4X2  -1.0    NaN     NaN     NaN
    2   21-3667994  SCA963  -1.0    NaN     NaN     NaN
    

    etc. If you are not interested in one column or anothter, you can simply drop() it.