When cleaning HTML documents with jsoup I like the fact that it automatically applies pretty-printing. I know I can deactivate it on a per-document basis, but I like to apply it for most of the document with the exception of certain problematic regions in which jsoup does not do a good job.
An example would be DIV
tags with CSS specifying white-space: pre-wrap;
, i.e. semantically they behave like PRE
tags, which means that the browser will be sensitive to line feeds and other whitespace (indentation). It gets worse if inside those regions there are more tags like BR
, SPAN
etc. because pretty-printing is applied and destroys the intended formatting of those regions.
So instead of deactivating pretty-printing completely for the whole HTML document, I like to selectively deactivate it whenever the parser meets something like div.listing
(yes, I know the CSS class name of the problematic region) and retain the original HTML there. How would I go about implementing this?
Update: I forgot to mention that I print the cleaned document using
output.print(document);
where output
is a PrintStream
and document
is the jsoup Document
instance. So if there is a better way to output the document, I am also open for suggestions.
What I ended up doing, because I got no answer here so far and found no better solution either, is horrible, but it works:
toString()
) + preserve pretty-printed HTML contenttoString()
) + re-parse content in original formatting into temporary documentwhite-space: pre-wrap
) by preserved original from step 2 via iterating over both documents in parallel via something likefor (int i = 0; i < elementsToBeFixed.size(); i++)
elementsToBeFixed.get(i).replaceWith(originalElements.get(i));
Conclusion: This is ugly because the same document which has already been parsed + cleaned up before needs to be re-parsed two more times, i.e. we have a total of 3 Jsoup.parse(..)
calls. But - it works. :-/
I am still waiting for a better answer here, though.