Search code examples
pythonhtmlxmlelementtree

parsing XML within HTML using python


I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:

<!DOCTYPE html>
<html>
<head>
    ***
</head>
<body>
    <div class="panel panel-primary call__report-modal-panel">
        <div class="panel-heading text-center custom-panel-heading">
            <h2>Report</h2>
        </div>
        <div class="panel-body">
            <div class="panel panel-default">
                <div class="panel-heading">
                    <div class="panel-title">Info</div>
                </div>
                <div class="panel-body">
                    <table class="table table-bordered table-page-break-auto table-layout-fixed">
                        <tr>
                            <td class="col-sm-4">ID</td>
                            <td class="col-sm-8">1</td>
                        </tr>

            </table>
        </div>
    </div>
</body>
</html>
<!--<?xml version = "1.0" encoding="Windows-1252" standalone="yes"?>
<ROOTTAG>
  <mytag>
    <headername>BASE</headername>
    <fieldname>NAME</fieldname>
    <val><![CDATA[Testcase]]></val>
  </mytag>
  <mytag>
    <headername>BASE</headername>
    <fieldname>AGE</fieldname>
    <val><![CDATA[5]]></val>
  </mytag>

</ROOTTAG>
-->

Requirement is to parse the XML which is in comments in above HTML. So far I have tried to read the HTML file and pass it to a string and did following:

with open('my_html.html', 'rb') as file:
    d = str(file.read())
    d2 = d[d.index('<!--') + 4:d.index('-->')]
    d3 = "'''"+d2+"'''"

this is returning XML piece of data in string d3 with 3 single qoutes.

Then trying to read it via Etree:

ET.fromstring(d3)

but it is failing with following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

need some help to basically:

  • Read HTML
  • take out snippet with XML piece which is commented at bottom of HTML
  • take that string and pass to ET.fromString() function, but since this function takes string with triple qoutes, it is not formatting it properly and hence throwing the error

Solution

  • You already have been on the right path. I put your HTML in the file and it works fine like following.

    import xml.etree.ElementTree as ET
    
    with open('extract_xml.html') as handle:
        content = handle.read()
        xml = content[content.index('<!--')+4: content.index('-->')]
        document = ET.fromstring(xml)
    
        for element in document.findall("./mytag"):
            for child in element:
                print(child, child.text)