Search code examples
pythonpython-3.xpetl

Reading XML files with Petl


I'm trying to parse information from an XML file into a table that has already been created from another CSV file with Petl and am having trouble with the syntax of the fromxml() function.

The XML file contains:

    <Locations>
  <qld_location>
    <Suburb>Brisbane-Central</Suburb>
    <Lat>-27.46758</Lat>
    <Long>153.027892</Long>
  </qld_location>
  <qld_location>
    <Suburb>Robertson</Suburb>
    <Lat>-27.565733</Lat>
    <Long>153.057213</Long>
  </qld_location>
  <qld_location>
    <Suburb>Logan-Village</Suburb>
    <Lat>-27.767054</Lat>
    <Long>153.116881</Long>
  </qld_location>
</Locations>

I currently have a table:

import petl as etl

table = (
        etl
            .fromcsv('QLD_Health_Care_Practices.csv')
            .convert('Practice_Name', 'upper')
            .convert('Suburb', str)
            .convert('State', str)
            .convert('Postcode', str)
    )

    +-----------------------------------+--------------------+-------+----------+
    | Practice_Name                     | Suburb             | State | Postcode |
    +===================================+====================+=======+==========+
    | 'BRISBANE CENTRE HEALTH SERVICES' | 'Brisbane-Central' | 'QLD' | '4000'   |
    +-----------------------------------+--------------------+-------+----------+
    | 'ROBERTSON FAMILY PRACTICE'       | 'Robertson'        | 'QLD' | '4109'   |
    +-----------------------------------+--------------------+-------+----------+
    | 'LOGAN VILLAGE CLINIC'            | 'Logan-Village'    | 'QLD' | '4207'   |
    +-----------------------------------+--------------------+-------+----------+
    | 'IPSWICH HEALTH CLINIC'           | 'Ipswich'          | 'QLD' | '4305'   |
    +-----------------------------------+--------------------+-------+----------+
    | 'CATTLE CREEK CLINIC'             | 'Cattle Creek'     | 'QLD' | '4407'   |
    +-----------------------------------+--------------------+-------+----------+

and would like to add the Longitude and Latitude in new columns from the XML file.

I am attempting to use the function:

table1= (etl.fromxml('QLD_Locations.xml', 'Locations', 'qld_location', 'Suburb', 'Lat', 'Long')

but am having troubles understanding the arguments needed for this syntax of XML.

Any help would be much appreciated, thanks in advance.


Solution

  • A "flat" argument list to .fromxml() only works if you also have a flat kind of XML structure, where all the data columns have the same name:

    <ROOT>
      <RECORD>
        <DATA /><DATA /><DATA /><DATA /><DATA />
      </RECORD>
    </ROOT>
    

    This can be extracted as

    etl.fromxml('file.xml', 'RECORD', 'DATA')
    

    To extract data from a more a complex structure with various column names or nesting, like you have:

    <ROOT>
      <RECORD>
        <A /><B /><C /><D /><E />
      </RECORD>
    </ROOT>
    

    Here you must explicitly specify each column you want to extract, using a dict:

    etl.fromxml('file.xml', 'RECORD', {
        'Column 1': 'A',
        'Column 2': 'B',
        'Column 3': 'C',
        'Column 4': 'D',
        'Column 5': 'E',
    })
    

    Where 'A', 'B', 'C' and so on are XPath expressions relative to RECORD.