Search code examples
javajsouppretty-print

Selectively deactivate pretty-print in jsoup for certain regions in a document


When cleaning HTML documents with jsoup I like the fact that it automatically applies pretty-printing. I know I can deactivate it on a per-document basis, but I like to apply it for most of the document with the exception of certain problematic regions in which jsoup does not do a good job.

An example would be DIV tags with CSS specifying white-space: pre-wrap;, i.e. semantically they behave like PRE tags, which means that the browser will be sensitive to line feeds and other whitespace (indentation). It gets worse if inside those regions there are more tags like BR, SPAN etc. because pretty-printing is applied and destroys the intended formatting of those regions.

So instead of deactivating pretty-printing completely for the whole HTML document, I like to selectively deactivate it whenever the parser meets something like div.listing (yes, I know the CSS class name of the problematic region) and retain the original HTML there. How would I go about implementing this?

Update: I forgot to mention that I print the cleaned document using

output.print(document);

where output is a PrintStream and document is the jsoup Document instance. So if there is a better way to output the document, I am also open for suggestions.


Solution

  • What I ended up doing, because I got no answer here so far and found no better solution either, is horrible, but it works:

    1. Parse + clean document
    2. Get (via toString()) + preserve pretty-printed HTML content
    3. Deactivate pretty-printing, then get (via toString()) + re-parse content in original formatting into temporary document
    4. Re-parse pretty-printed content from step 2 and assign to original document (overwriting it), then deactivate pretty-printing for the newly assigned document
    5. Replace content for which pretty-printing would destroy formatting (e.g. elements with CSS white-space: pre-wrap) by preserved original from step 2 via iterating over both documents in parallel via something like
    for (int i = 0; i < elementsToBeFixed.size(); i++) 
      elementsToBeFixed.get(i).replaceWith(originalElements.get(i));
    

    Conclusion: This is ugly because the same document which has already been parsed + cleaned up before needs to be re-parsed two more times, i.e. we have a total of 3 Jsoup.parse(..) calls. But - it works. :-/

    I am still waiting for a better answer here, though.