I am using the following code snippet to extract the entire text content from an HTML document, using Jsoup:
String text = doc.body().text();
System.out.println(text);
It does work but unfortunately all the text content is a single line, with no linebreaks. If I redirect the output to a text file, the text file has just one inordinately long line.
Question: What is the correct way to extract the entire text content from an HTML document such that when the text content is written to a file, it is properly newlined, as needed?
You can use wholeText()
to keep the line breaks in the text:
Document doc = Jsoup.connect("YourWebPage").get();
String textWithLines = Jsoup.parse(doc.html()).wholeText();
System.out.println(textWithLines);