for Example:
<div>
this is first
<div>
second
</div>
</div>
I am working on Natural Language Processing and I have to translate a website(not by using Google Translate) for which i have to extract both sentences "this is first" and "second" separately so that i can replace them with other language text in respective divs. If i extract text for first it will show "this is first second" and if I using recursion to dig deeper, it will only extract "second"
Help me out please!
EDIT
Using ownText() method will create problem in the following html code:
<div style="top:+0.2em; font-size:95%;">
the
<a href="/wiki/Free_content" title="Free content">
free
</a>
<a href="/wiki/Encyclopedia" title="Encyclopedia">
encyclopedia
</a>
that
<a href="/wiki/Wikipedia:Introduction" title="Wikipedia:Introduction">
anyone can edit
</a>
.
</div>
It will print:
the that.
free
encyclopedia
anyone can edit
But it must be:
the
that
.
encyclopedia
anyone can edit
If i extract text for first it will show "this is first second"
Use ownText()
instead of text()
and you'll get only the element contains directly.
Here's an example:
final String html = "<div>\n"
+ " this is first\n"
+ " <div>\n"
+ " second\n"
+ " </div>\n"
+ "</div>";
Document doc = Jsoup.parse(html); // Get your Document from somewhere
Element first = doc.select("div").first(); // Select 1st element - take the first found
String firstText = first.ownText(); // Get own text
Element second = doc.select("div > div").first(); // Same as above, but with 2nd div
String secondText = second.ownText();
System.out.println("1st: " + firstText);
System.out.println("2nd: " + secondText);