Given the following example code:
bla bla
<div class="a">
<div class="b">beta</div>
bla bla bla
<div class="c">charlie</div>
<b>bold</b>
etc ...
</div>
How do I extract the content of the tag <div class="a">
. Please note there are an unknown number of similar tags nested inside the parent tag. A simple regex like:
<div class="a">(.*?)</div>
does not work because it will return:
<div class="b">beta
instead of the actual contents of the tag.
The regex should somehow count the number of opening and closing div tags to determine where to stop. I am not sure this is even possible in regex hence my question.
Update: My question is not on how to extract a tags data by regex in general. My question is how to make sure all tag contents is extracted (like a html parser).
It is not possible to fully parse html with normal regex without some extensions.
Using regular expressions to parse HTML: why not?
With that said, you could parse the html yourself or use something like jSoup.