I have HTML like this
<h2 id="17273">bla bla bla 1</h2>
<p>Text i need</p>
<p>Text i need</p>
<p>Text i need</p>
<h2 id="45626">bla bla bla 2</h2>
<p>Text i need</p>
<p>Text i need</p>
<p>Text i need</p>
<h2 id="78519">bla bla bla 3</h2>
<p>Text i need</p>
<p>Text i need</p>
<h2 id="72725">bla bla bla 2</h2>
<p>Text i need</p>
<p>Text i need</p>
I want to extract all p tags after/between h2 tags and map it with the above h2 tags like this:
[(h2 with id 17273 = all p tags below it), (h2 with id 45626 = all p tags below it)]
To be honest, I don't know how to achieve that, I've tried few things like doc.siblingElements()
and some other things, but I was not able to achieve something like that.
Since the < h2 > and < p > tags are not linked in any way, you can use regex to artificially create dependencies between them:
String x = html //your html String
.replaceAll("</p>\\s+<h2", "</p></parent>\n<h2")
.replaceAll("<h2", "<parent><h2")
+ "</parent>";
Then using Jsoup is relatively simple:
Document doc = Jsoup.parse(x);
Elements parents = doc.getElementsByTag("parent");
for (Element e : parents) {
Elements h2 = e.getElementsByAttribute("id");
String id = h2.attr("id");
Elements pElements = e.getElementsByTag("p");
List<String> pList = new ArrayList<>();
for (Element p : pElements)
pList.add(p.text());
System.out.println("h2 with id " + id + " = " + pList);
}
The output received:
h2 with id 17273 = [1 Text i need, 1 Text i need, 1 Text i need]
h2 with id 45626 = [2 Text i need, 2 Text i need, 2 Text i need]
h2 with id 78519 = [3 Text i need, 3 Text i need]
h2 with id 72725 = [4 Text i need, 4 Text i need]