Search code examples
xmlgroovyxml-parsingxmlslurper

How to parse XML comments in Groovy?


Is there any way to parse XML comments in Groovy?

Both XMLParser and XMLSluprer don't seem to support comments nodes.

Suppose following file (example.html):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>title</title>
<body>
<table cellpadding="1" cellspacing="1" border="1">
<thead>
<tr><td rowspan="1" colspan="3">title</td></tr>
</thead><tbody>
<!--I cannot be seen-->
<tr>
	<td>x</td>
	<td>x</td>
	<td>x</td>
</tr>
</tbody></table>
</body>
</html>

Here is my code:

def parser = new XmlSlurper(false, false)
parser.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false)
parser.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)

def response = parser.parse('example.html')

And when I use

println XmlUtil.serialize(response)

to output the file, no comment can be seen.


Solution

  • as soon as you have html - it's possible to use jsoup to parse

    @Grab(group='org.jsoup', module='jsoup', version='1.11.3')
    import org.jsoup.Jsoup
    import org.jsoup.nodes.Document
    
    def html = '''<?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>title</title>
    <body>
    <table cellpadding="1" cellspacing="1" border="1">
    <thead>
    <tr><td rowspan="1" colspan="3">title</td></tr>
    </thead><tbody>
    <!--I cannot be seen-->
    <tr>
        <td>x</td>
        <td>x</td>
        <td>x</td>
    </tr>
    </tbody></table>
    </body>
    </html>'''
    
    Document doc = Jsoup.parse(html)
    println doc.select('html body table tbody').first()?.childNodes()?.find{it.nodeName()=='#comment'}?.getData()