I'm trying to extract all of the data out of heading tags in a word document converted to html (via word)
I have the following regex:
<(?<Class>h[5|6|7|8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?: )+.+</span>(?<Text>.*?)(?:</h[5|6|7|8]>)?
and my source text looks as follows
<h5>(1)<span style='font:7.0pt "Times New Roman"'>
</span>The Scheme (planning scheme) has been
prepared in accordance with the <i>asdf </i>(the Act)
as a framework for managing development in a way that advances the purpose of
the Act.</h5>
<h5>(2)<span style='font:7.0pt "Times New Roman"'>
</span>In seeking to achieve this purpose, the planning scheme sets out
the future development in the
planning scheme area over the next 20 years.</h5>
<h5>(3)<span style='font:7.0pt "Times New Roman"'>
</span>While the planning scheme has been prepared with a 20 year horizon, it
will be reviewed periodically in accordance with the Act to ensure that it
responds appropriately to the changes of the community at Local, Regional and State
levels.</h5>
The regex appears to work however it will capture from the first h5 down to the last one or any other h6|7|8.
I'm not trying to do anything to complex here with the data just need a simple extract so I'd like to stick with regex as opposed to using a html parser, it would be fair to say in my examples that the headings are well formed, ie. a hX is always closed by a hX and not a hY and headings don't have headings inside them or anything funky like that.
I thought adding the ? to the end of (?:) would make it nongreedy so it would only match the first instance and not as many as it could, am I missing something here about how the greediness works?
EDIT:
The regex
<(?<Class>h[5-8])>(?<ListIdentifier>.*?)<span style='font:7.0pt "Times New Roman"'>(?: )+.+?</span>(?<Text>.*?)(?:</h[5-8]>)
also seems to match
<h6> </h6>
<h6> </h6>
<h6> </h6>
<h6> </h6>
<h5>(1)<span style='font:7.0pt "Times New Roman"'>
</span>Short Title -The planning scheme policy may be cited as PSP No 2. –
Engineering Standards – Road and Drainage Infrastructure.</h5>
so it includes the whole text whereas I would like it to ignore the h6s with the nbsp as they dont have the span within them
There is a greedy .+
in the middle of the regex that is causing the problem (just before </span>
). Change that to .+?
and your regex should work correctly.
Note that your character classes should be [5678]
instead of [5|6|7|8]
(the OR between characters is implied), and can even be shortened to [5-8]
.
You should also remove the trailing ?
from the end, (?:</h[5-8]>)?
should be (?:</h[5-8]>)
. Without this change your match will end before it should.
edit: The reason that the current regex is matching the text that you put in your edit is that the .*?
in the ListIdentifier group will match a </hX>
if the span and nbsp are not seen before it. You should be able to fix this by changing that .*?
to [^<]*
, which won't match any less than signs so it will require that the span is present.
The result:
<(?<Class>h[5-8])>(?<ListIdentifier>[^<]*)<span style='font:7.0pt "Times New Roman"'>(?: )+.+?</span>(?<Text>.*?)(?:</h[5-8]>)