Search code examples
javahtmljsouphtml-content-extraction

Java: How do I extract separated text from nested <div> in HTML?


for Example:

<div>
    this is first
    <div>
        second
   </div>
</div>

I am working on Natural Language Processing and I have to translate a website(not by using Google Translate) for which i have to extract both sentences "this is first" and "second" separately so that i can replace them with other language text in respective divs. If i extract text for first it will show "this is first second" and if I using recursion to dig deeper, it will only extract "second"

Help me out please!

EDIT

Using ownText() method will create problem in the following html code:

<div style="top:+0.2em; font-size:95%;">
    the
    <a href="/wiki/Free_content" title="Free content">
        free
    </a>
    <a href="/wiki/Encyclopedia" title="Encyclopedia">
        encyclopedia
    </a>
    that
    <a href="/wiki/Wikipedia:Introduction" title="Wikipedia:Introduction">              
        anyone can edit
    </a>
    .
</div>

It will print:

the that.

free

encyclopedia

anyone can edit

But it must be:

the

that

.

encyclopedia

anyone can edit


Solution

  • If i extract text for first it will show "this is first second"

    Use ownText() instead of text() and you'll get only the element contains directly.

    Here's an example:

    final String html = "<div>\n"
            + "    this is first\n"
            + "    <div>\n"
            + "        second\n"
            + "   </div>\n"
            + "</div>";
    
    Document doc = Jsoup.parse(html); // Get your Document from somewhere
    
    
    Element first = doc.select("div").first(); // Select 1st element - take the first found
    String firstText = first.ownText(); // Get own text
    
    Element second = doc.select("div > div").first(); // Same as above, but with 2nd div
    String secondText = second.ownText();
    
    System.out.println("1st: " + firstText);
    System.out.println("2nd: " + secondText);