Search code examples
pythonxmldataframe

Read a long XML file through python


I've never worked with XML files. I have this XML file which is a really long file containing various information about different type of financial instrument.

<RiskFullInfo><dossier>1234</dossier><stockAmounts><stockAmount><dossier>1234</dossier><sottorubrica>0</sottorubrica><idInstrument>173</idInstrument><shortname>ABC.MI</shortname><name>abc</name><data>dd/mm/yyyy 00:00:00</data><quantita>number</quantita><pmc>somenumber</pmc><pmcEuro>eurovalyes</pmcEuro><marketprice>europrice</marketprice>

this is the structure repeated for n instruments.

What I would like to do is to put in a dataframe the text/values of fields idInstrument, shortname,quantita for ALL the n instruments in this file.

Once used the code to get the root with ET, I tried to extract the information I need with node.attrib and node.find but I get an empty dataframe.

import xml.etree.ElementTree as ET
tree = ET.parse(r'path')
root = tree.getroot()
df_cols = ["name", "Position"]
rows = []
for node in root:
    s_name = node.attrib.get("shortname")
    s_q = node.find("quantita")

Can someone help on this?


Solution

  • IIUC, you can just use read_xml :

    use_cols = ["idInstrument", "shortname", "quantita"]
    
    df = pd.read_xml("file.xml", xpath=".//stockAmount")[use_cols]
    

    Output :

    print(df)
    
       idInstrument shortname quantita
    0           173    ABC.MI   number
    

    Otherwise, use this :

    tree = ET.parse("file.xml")
    root = tree.getroot().findall(".//stockAmount")
    
    use_cols = ["idInstrument", "shortname", "quantita"]
    
    df = pd.DataFrame({key: [e.find(key).text for e in root] for key in use_cols})