Search code examples
htmlgroovyxercesxmlslurper

how to find offending line when using XmlSlurper


I am parsing a dirty html page with XmlSlurper, and I get the following error:

ERROR org.xml.sax.SAXParseException: Element type "scr" must be followed by either attribute specifications, ">" or "/>".
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        ...
[Fatal Error] :1157:22: Element type "scr" must be followed by either attribute specifications, ">" or "/>".

Now, I have the html I feed it and print it before doing so. If I open it and try to go to the line mentioned in the error, 1157, there is no 'src' in there (but there are hundreds of such string in the file). So I guess some additional stuff is inserted (maybe <script> or something like that) that changes line numbers.

Is there a good way to find exactly the offending line or html piece?


Solution

  • Which SAXParser are you using? HTML is not strict XML, so using XMLSlurper with the default parser is probably going to result in continued errors.

    A cursory google search for "Groovy html slurper" led me to HTML Scraping With Groovy which points to a SaxParser called TagSoup.

    Give that a whirl and see if it parses the dirty page.