Search code examples
htmlregexhtml-heading

.net regex html heading


I'm trying to extract all of the data out of heading tags in a word document converted to html (via word)

I have the following regex:

<(?<Class>h[5|6|7|8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?:&nbsp;)+.+</span>(?<Text>.*?)(?:</h[5|6|7|8]>)?

and my source text looks as follows

<h5>(1)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span>The Scheme (planning scheme) has been
prepared in accordance with the <i>asdf </i>(the Act)
as a framework for managing development in a way that advances the purpose of
the Act.</h5>

<h5>(2)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span>In seeking to achieve this purpose, the planning scheme sets out
the future development in the
planning scheme area over the next 20 years.</h5>

<h5>(3)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span>While the planning scheme has been prepared with a 20 year horizon, it
will be reviewed periodically in accordance with the Act to ensure that it
responds appropriately to the changes of the community at Local, Regional and State
levels.</h5>

The regex appears to work however it will capture from the first h5 down to the last one or any other h6|7|8.

I'm not trying to do anything to complex here with the data just need a simple extract so I'd like to stick with regex as opposed to using a html parser, it would be fair to say in my examples that the headings are well formed, ie. a hX is always closed by a hX and not a hY and headings don't have headings inside them or anything funky like that.

I thought adding the ? to the end of (?:) would make it nongreedy so it would only match the first instance and not as many as it could, am I missing something here about how the greediness works?

EDIT:

The regex

<(?<Class>h[5-8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?:&nbsp;)+.+?</span>(?<Text>.*?)(?:</h[5-8]>)

also seems to match

<h6>&nbsp;</h6>

<h6>&nbsp;</h6>

<h6>&nbsp;</h6>

<h6>&nbsp;</h6>

<h5>(1)<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span>Short Title -The planning scheme policy may be cited as PSP No 2. –
Engineering Standards – Road and Drainage Infrastructure.</h5>

so it includes the whole text whereas I would like it to ignore the h6s with the nbsp as they dont have the span within them


Solution

  • There is a greedy .+ in the middle of the regex that is causing the problem (just before </span>). Change that to .+? and your regex should work correctly.

    Note that your character classes should be [5678] instead of [5|6|7|8] (the OR between characters is implied), and can even be shortened to [5-8].

    You should also remove the trailing ? from the end, (?:</h[5-8]>)? should be (?:</h[5-8]>). Without this change your match will end before it should.

    edit: The reason that the current regex is matching the text that you put in your edit is that the .*? in the ListIdentifier group will match a </hX> if the span and nbsp are not seen before it. You should be able to fix this by changing that .*? to [^<]*, which won't match any less than signs so it will require that the span is present.

    The result:

    <(?<Class>h[5-8])>(?<ListIdentifier>[^<]*)<span style='font:7.0pt "Times New Roman"'>(?:&nbsp;)+.+?</span>(?<Text>.*?)(?:</h[5-8]>)