I'm trying to parse information from an XML file into a table that has already been created from another CSV file with Petl and am having trouble with the syntax of the fromxml()
function.
The XML file contains:
<Locations>
<qld_location>
<Suburb>Brisbane-Central</Suburb>
<Lat>-27.46758</Lat>
<Long>153.027892</Long>
</qld_location>
<qld_location>
<Suburb>Robertson</Suburb>
<Lat>-27.565733</Lat>
<Long>153.057213</Long>
</qld_location>
<qld_location>
<Suburb>Logan-Village</Suburb>
<Lat>-27.767054</Lat>
<Long>153.116881</Long>
</qld_location>
</Locations>
I currently have a table:
import petl as etl
table = (
etl
.fromcsv('QLD_Health_Care_Practices.csv')
.convert('Practice_Name', 'upper')
.convert('Suburb', str)
.convert('State', str)
.convert('Postcode', str)
)
+-----------------------------------+--------------------+-------+----------+
| Practice_Name | Suburb | State | Postcode |
+===================================+====================+=======+==========+
| 'BRISBANE CENTRE HEALTH SERVICES' | 'Brisbane-Central' | 'QLD' | '4000' |
+-----------------------------------+--------------------+-------+----------+
| 'ROBERTSON FAMILY PRACTICE' | 'Robertson' | 'QLD' | '4109' |
+-----------------------------------+--------------------+-------+----------+
| 'LOGAN VILLAGE CLINIC' | 'Logan-Village' | 'QLD' | '4207' |
+-----------------------------------+--------------------+-------+----------+
| 'IPSWICH HEALTH CLINIC' | 'Ipswich' | 'QLD' | '4305' |
+-----------------------------------+--------------------+-------+----------+
| 'CATTLE CREEK CLINIC' | 'Cattle Creek' | 'QLD' | '4407' |
+-----------------------------------+--------------------+-------+----------+
and would like to add the Longitude and Latitude in new columns from the XML file.
I am attempting to use the function:
table1= (etl.fromxml('QLD_Locations.xml', 'Locations', 'qld_location', 'Suburb', 'Lat', 'Long')
but am having troubles understanding the arguments needed for this syntax of XML.
Any help would be much appreciated, thanks in advance.
A "flat" argument list to .fromxml()
only works if you also have a flat kind of XML structure, where all the data columns have the same name:
<ROOT>
<RECORD>
<DATA /><DATA /><DATA /><DATA /><DATA />
</RECORD>
</ROOT>
This can be extracted as
etl.fromxml('file.xml', 'RECORD', 'DATA')
To extract data from a more a complex structure with various column names or nesting, like you have:
<ROOT>
<RECORD>
<A /><B /><C /><D /><E />
</RECORD>
</ROOT>
Here you must explicitly specify each column you want to extract, using a dict:
etl.fromxml('file.xml', 'RECORD', {
'Column 1': 'A',
'Column 2': 'B',
'Column 3': 'C',
'Column 4': 'D',
'Column 5': 'E',
})
Where 'A', 'B', 'C' and so on are XPath expressions relative to RECORD
.