Search code examples
pythonxmlbeautifulsoupminidom

XML Parsing in python (Elia structure)


i would like parse this xml kind file:

<?xml version="1.0" encoding="utf-8"?>
<SolarForecastingChartDataForZone xmlns="http://schemas.datacontract.org/2004/07/Elia.PublicationService.DomainInterface.SolarForecasting.v3" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
    <ErrorMessage i:nil="true"/>
    <IntervalInMinutes>15</IntervalInMinutes>
    <SolarForecastingChartDataForZoneItems>
    <SolarForecastingChartDataForZoneItem>
        <DayAheadForecast>-50</DayAheadForecast>
        <DayAheadP10>-50</DayAheadP10>
        <DayAheadP90>-50</DayAheadP90>
        <Forecast>0</Forecast>
        <ForecastP10>0</ForecastP10>
        <ForecastP90>0</ForecastP90>
        <ForecastUpdated>0</ForecastUpdated>
        <IntraDayP10>-50</IntraDayP10>
        <IntraDayP90>-50</IntraDayP90>
        <LoadFactor>0</LoadFactor>
        <RealTime>0</RealTime>
        <StartsOn xmlns:a="http://schemas.datacontract.org/2004/07/System">
            <a:DateTime>2013-09-29T22:00:00Z</a:DateTime>
            <a:OffsetMinutes>0</a:OffsetMinutes>
        </StartsOn>
        <WeekAheadForecast>-50</WeekAheadForecast>
        <WeekAheadP10>-50</WeekAheadP10>
        <WeekAheadP90>-50</WeekAheadP90>
    </SolarForecastingChartDataForZoneItem>
    <SolarForecastingChartDataForZoneItem>
        <DayAheadForecast>-50</DayAheadForecast>
        <DayAheadP10>-50</DayAheadP10>
        <DayAheadP90>-50</DayAheadP90>
        <Forecast>0</Forecast>
        <ForecastP10>0</ForecastP10>
        <ForecastP90>0</ForecastP90>
        <ForecastUpdated>0</ForecastUpdated>
....

to recover the level <Forecast> and <a:DateTime>

I tried with beautiful soup and minidom, for example:

from xml.dom import minidom
xmldoc = minidom.parse('xmlfile')
itemlist = xmldoc.getElementsByTagName('Forecast')
print(len(itemlist)) #to get the number of savings
for s in xmldoc.getElementsByTagName('Forecast'):
    print s.nodeValue

But i can't have any value. I guess i'm wrong but i don't understand why. Someone could help me? Thank you


Solution

  • Not exactly sure what your desired output is but I was working with LXML and XPATH when I saw this question.

    from lxml import html
    mystring = ''' I cut and pasted your string here '''
    tree = html.fromstring(mystring)
    >>> for forecast in tree.xpath('//forecast'):
           forecast.text_content()
    
    '0'
    '0'
    >>> for dtime in tree.xpath('//datetime'):
            dtime.text_content()
    
    
     '2013-09-29T22:00:00Z'
    >>> 
    

    and then to mess around a bit more

    all_elements = [e for e in tree.iter()]
    for each_element in all_elements[1:]:  # The first element is the root - it has all the text without the tags though so I don't want to look at this one
        each_element.tag, each_element.text_content()
    
    ('errormessage', '')
    ('intervalinminutes', '15')
    ('solarforecastingchartdataforzoneitems', '\n    \n        -50\n        -50\n        -50\n        0\n        0\n        0\n        0\n        -50\n        -50\n        0\n        0\n        \n            2013-09-29T22:00:00Z\n            0\n        \n        -50\n        -50\n        -50\n    \n    \n        -50\n        -50\n        -50\n        0\n        0\n        0\n        0')
    ('solarforecastingchartdataforzoneitem', '\n        -50\n        -50\n        -50\n        0\n        0\n        0\n        0\n        -50\n        -50\n        0\n        0\n        \n            2013-09-29T22:00:00Z\n            0\n        \n        -50\n        -50\n        -50\n    ')
    ('dayaheadforecast', '-50')
    ('dayaheadp10', '-50')
    ('dayaheadp90', '-50')
    ('forecast', '0')
    ('forecastp10', '0')
    ('forecastp90', '0')
    ('forecastupdated', '0')
    ('intradayp10', '-50')
    .
    .
    .