I have this following HTML file structure:
<table>
<tr class="heading">
<td colspan="2">
<h2 class="groupheader">Public Types</h2>
<!-- I don't want that! We're in a table.-->
</td>
</tr>
<tr>...</tr>
</table>
<h2 class="groupheader">Detailed Description</h2>
<!-- I want all that until the next h2-->
<div class="textblock"><p>Provides the functions to control the generation of a single data log file. </p>
<h4>Example</h4>
<div class="fragment"><div class="line">Test <a href="aaa">stuff</a>();</div>
<div class="line">...</div>
<div class="line">...</div>
</div>
</div> <!-- end of first result -->
<h2 class="groupheader">Member</h2>
<!-- I want all that until the next h2 or hr-->
<a class="anchor"></a>
<div class="memitem">
<div class="memproto">
<table class="memname">
<tr>
<td class="memname">enum <a class="el" href="...">test</a></td>
</tr>
</table>
</div><div class="memdoc">
<hr><!-- End of 2nd result -->
And with a Regexp, I need to get all the content between each titles till the next title or hr tag, expect if it's a in a table.
So far, I've got all my h2->h2|hr content. It goes like:
(?s)(<h2 class="groupheader">.*?)(<h2|<hr)
How can I skip the content under the H2 that is contained in the table? I've tried noodling with a negative look behind but I'm not getting anywhere.
Thank you for the help.
NOTE THAT HTML SHOULD BE PARSED WITH AN APPROPRIATE PARSER
Now, since we are left with just HTML-looking input, and a task
to get all the content between each titles till the next title or hr tag, expect if it's a in a table
let me show how it could be done.
You can obtain the substrings you need with the help of a tempered greedy token ((?:(?!<\/table|<h2|<hr)(?:<table\b[^<]*>.*?<\/table>|.))*)
(that matches any symbol that is not starting any of the alternatives in the negative lookahead before it - thus, keeping the match within the <table>
boundaries - and also matching the inner tables) with a positive lookahead at the end:
(?s)<h2 class="groupheader">[^<]*<\/h2>\s*((?:(?!<\/table|<h2|<hr)(?:<table\b[^<]*>.*?<\/table>|.))*)(?=<h2|<hr)
See demo.
Note that instead of h2
you can use h\d+
to support any level of h
.