Search code examples
jsoup

Extracting the entire text content from an HTML document using Jsoup


I am using the following code snippet to extract the entire text content from an HTML document, using Jsoup:

String text = doc.body().text();
System.out.println(text);

It does work but unfortunately all the text content is a single line, with no linebreaks. If I redirect the output to a text file, the text file has just one inordinately long line.


Question: What is the correct way to extract the entire text content from an HTML document such that when the text content is written to a file, it is properly newlined, as needed?


Solution

  • You can use wholeText() to keep the line breaks in the text:

    Document doc = Jsoup.connect("YourWebPage").get();
    String textWithLines = Jsoup.parse(doc.html()).wholeText();
    System.out.println(textWithLines);