Search code examples
javahtmljsoup

Extracting body's attribute, by also keeping the relative tag


I am trying to traverse a HTML body, in order to find all the <h1> tags:

Element body = docJSoup.body();
Elements mainCmp = body.select("h1");

So, considering this body's fragment:

<h1><span style='mso-bookmark:_Toc283737133'><span
style='mso-spacerun:yes'></span><span style='mso-spacerun:yes'></span><a
name="_Toc35343186"></a><a name="_Toc264704629"></a><span style='mso-bookmark:
_Toc35343186'>3<span style='mso-tab-count:1'></span>Aspetti metodologici</span></span></h1>

I'm going to get this:

<span style="mso-bookmark:_Toc283737133"><span style="mso-spacerun:yes"></span><span style="mso-spacerun:yes"></span><a name="_Toc35343186"></a><a name="_Toc264704629"></a><span style="mso-bookmark:
_Toc35343186">3<span style="mso-tab-count:1"></span>Aspetti metodologici</span></span>

By the way, I would like to maintain also the <h1> tag into the result. And the <h1> tag itself could also have other attributes, so I cannot just concatenate "<h1>" to the resulting string. Is there a way to keep it using JSoup methods?

Thanks for any insights.


Solution

  • outerHtml() will give you the node's markup including its own opening and closing tags.