Search code examples
javajtidy

JTidy reports "3 errors were found!"... but does not say what they are


I have a large block of programmatically generated HTML. I ran it through Tidy (version r938) with the following Java code:

StringReader inStr = new StringReader(htmlInput);
StringWriter outStr = new StringWriter();
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parseDOM(inStr, outStr);

I get the following output:

InputStream: Document content looks like HTML 4.01 Transitional
247 warnings, 3 errors were found!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

Trouble is, Tidy doesn't tell me what 3 errors it found.

I'm fibbing here a little. The output above actually follows a long list of all 247 warnings (mostly trimming out empty div elements). I can suppress those with tidy.setShowWarnings(false); either way, I see no error report, so I can't figure out what I need to fix. 300Kb of HTML is too much for me to eyeball.

I've tried numerous approaches to finding the error. I can't run it through validate.w3.org, sadly, as the HTML file is on a proprietary network. The most informative approach was to open it in IntelliJ IDEA; this revealed a dozen or so duplicate div IDs, which I fixed. Errors still occurred.

I've looked around for other mentions of this problem. While I find plenty of hits on things like "How can I get the error/warning messages out of the parsed HTML using JTidy?", they all appear to be asking for dissimilar things, or assume conditions that simply aren't holding for me. I'm getting warnings just fine, for example; it's the errors I need, and they're not being reported, even if I call setShowErrors(100) or something.

Am I going to have to dive into Tidy's source code and debug it, starting where it reports errors? Or is there something much simpler I could do?


Solution

  • Here's what I ended up doing to track down the errors:

    1. Download JTidy's source. Most people should be able to go straight to the source.
    2. Unzip the source into my dev area. Right on top of my existing source code. This also meant removing the Maven entry for JTidy from my pom.xml. (It also meant beating IntelliJ into submission (re: editing the relevant .iml files and restarting IJ a lot) when it got extremely confused by this.)
    3. Set a breakpoint in Report.error. The first line of org.w3.tidy.Report.error() increments lexer.errors; error() is called from many places in the lexer.
    4. Run my program in debug mode. Expect this to take a little while if the input HTML is large; a 300k file took around 10-15 seconds on my machine to stop on an error that turned out to be at the very end of the file.
    5. Look at the contents of lexbuf. lexbuf is a byte array, so your IDE might not show it as text. It might also be large. You probably want to look at what index the lexer was looking at within lexbuf. If you have to, take that section of the byte array and cross-reference it with an ASCII table to get the text.
    6. Search for that text in your HTML. Assuming it appears only once, there's your error. (In my case, it appeared exactly three times, and sure enough, I had three errors reported.)

    This was much more involved than it probably should have been. I suspect Report.error() was being called inappropriately.

    In my case, error() was called with the constant BAD_CDATA_CONTENT. This constant is used only by Report.warning(). error() doesn't know what to do with it, and just exits silently with no message at all. If I change the call in Lexer.getCDATA() from error() to warning(), I get the exact line and column of my error. (I also get what appears to be reasonably well-formed XHTML, instead of an empty document.)

    I'd submit a ticket to the JTidy project with some suggestions, but SourceForge isn't letting me log in for some reason. So, here:

    • Given that this "error" appears not to doom the document to unparseability, I'll tentatively suggest that that call be made a warning instead. (In my specific case, it was an HTML tag inside a string constant or comment inside a script element; shouldn't have hurt anything. I asked another question about it, just in case.)
    • Report.error() should have a default case that reports an unhandled error code if it gets one.

    Hope this helps anyone else having what I'm guessing is a rather esoteric problem.