Search code examples
jsouphtml-parsing

Parse HTML to get text for individual elements using Jsoup


I need to parse the below text and create separate objects for each text. I tried a few ways to do it, but it does not provide the results in the format I need.

The text is:

String text = "This is start of a text&nbsp;<a href=\"https://google.com/sample\">followed by a link&nbsp;sample</a>and ending with some text."

Using the below code:

Document document = Jsoup.parse(text);
Elements elements = document.select("*");
for(Element e : elements){
System.out.println( e.tagName() + ": " + e.text());}

The actual results are

root: This is start of a text followed by a link sampleand ending with some text.
html: This is start of a text followed by a link sampleand ending with some text.
head: 
body: This is start of a text followed by a link sampleand ending with some text.
p: This is start of a text followed by a link sampleand ending with some text.
a: followed by a link sample

I need to get the below results so that I can create a custom object for each of the text

body: This is start of a text&nbsp;
a:followed by a link&nbsp;sample
body:and ending with some text.

Solution

  • To avoid returning text of all children use e.ownText(), but that's not enough in this case as you want to have separate This is start of a text and and ending with some text., but ownText() returns it joined: This is start of a text and ending with some text..
    To get the list of separated texts use e.textNodes() and the output for body will be:

    body: [
    This is start of a text&nbsp;, and ending with some text.]
    a: [followed by a link&nbsp;sample]
    

    and an additional advantage is you're keeping original &nbsp;.
    Also if you don't like redundant html: [] and head: [] added to your document you should use XML parser:

    Document document = Jsoup.parse(text, "", Parser.xmlParser());
    

    To keep both text separated and <a> text in order try iterating recursively using: document.childNodes() and then childNodes() for every node. You can identify text nodes by checking if (node instanceof TextNode).