Search code examples
javahtmlparsingjsoup

Avoid removal of spaces and newline while parsing HTML using jsoup


I have a sample code as below.

String sample = "<html>
<head>
</head>
<body>
This is a sample on              parsing HTML body using jsoup
This is a sample on              parsing HTML body using jsoup
</body>
</html>";

Document doc = Jsoup.parse(sample);
String output = doc.body().text();

I get the output as

This is a sample on parsing HTML body using jsoup This is a sample on `parsing HTML body using jsoup`

But I want the output as

This is a sample on              parsing HTML body using jsoup
This is a sample on              parsing HTML body using jsoup

How do parse it so that I get this output? Or is there another way to do so in Java?


Solution

  • You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().

    Document doc = Jsoup.parse(sample);
    doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
    String output = doc.body().html();