Search code examples
htmlregextagsnegative-lookbehind

Finding a regexp pattern not preceeded by something


I have this following HTML file structure:

<table>
   <tr class="heading">
      <td colspan="2">
         <h2 class="groupheader">Public Types</h2> 
         <!-- I don't want that! We're in a table.-->
      </td>
   </tr>
   <tr>...</tr> 
</table>
<h2 class="groupheader">Detailed Description</h2>
  <!-- I want all that until the next h2-->
  <div class="textblock"><p>Provides the functions to control the generation of a single data log file. </p>
    <h4>Example</h4>
    <div class="fragment"><div class="line">Test <a href="aaa">stuff</a>();</div>
        <div class="line">...</div>     
        <div class="line">...</div>
    </div>
</div> <!-- end of first result -->

<h2 class="groupheader">Member</h2>
<!-- I want all that until the next h2 or hr-->
<a class="anchor"></a>
<div class="memitem">
<div class="memproto">
      <table class="memname">
        <tr>
          <td class="memname">enum <a class="el" href="...">test</a></td>
        </tr>
      </table>
</div><div class="memdoc">
<hr><!-- End of 2nd result -->

And with a Regexp, I need to get all the content between each titles till the next title or hr tag, expect if it's a in a table.

So far, I've got all my h2->h2|hr content. It goes like:

(?s)(<h2 class="groupheader">.*?)(<h2|<hr)

How can I skip the content under the H2 that is contained in the table? I've tried noodling with a negative look behind but I'm not getting anywhere.

Thank you for the help.


Solution

  • NOTE THAT HTML SHOULD BE PARSED WITH AN APPROPRIATE PARSER

    Now, since we are left with just HTML-looking input, and a task

    to get all the content between each titles till the next title or hr tag, expect if it's a in a table

    let me show how it could be done.

    You can obtain the substrings you need with the help of a tempered greedy token ((?:(?!<\/table|<h2|<hr)(?:<table\b[^<]*>.*?<\/table>|.))*) (that matches any symbol that is not starting any of the alternatives in the negative lookahead before it - thus, keeping the match within the <table> boundaries - and also matching the inner tables) with a positive lookahead at the end:

    (?s)<h2 class="groupheader">[^<]*<\/h2>\s*((?:(?!<\/table|<h2|<hr)(?:<table\b[^<]*>.*?<\/table>|.))*)(?=<h2|<hr)
    

    See demo.

    Note that instead of h2 you can use h\d+ to support any level of h.