Search code examples
regexcoldfusioncoldfusion-9

How to capture the actual html tag content using regex


Given the following example code:

bla bla 
<div class="a">
    <div class="b">beta</div> 
    bla bla bla 
    <div class="c">charlie</div> 
    <b>bold</b> 
    etc ... 
</div>

How do I extract the content of the tag <div class="a">. Please note there are an unknown number of similar tags nested inside the parent tag. A simple regex like:

<div class="a">(.*?)</div> 

does not work because it will return:

<div class="b">beta

instead of the actual contents of the tag.

The regex should somehow count the number of opening and closing div tags to determine where to stop. I am not sure this is even possible in regex hence my question.

Update: My question is not on how to extract a tags data by regex in general. My question is how to make sure all tag contents is extracted (like a html parser).


Solution

  • It is not possible to fully parse html with normal regex without some extensions.

    Using regular expressions to parse HTML: why not?

    With that said, you could parse the html yourself or use something like jSoup.

    https://www.bennadel.com/blog/2358-parsing-traversing-and-mutating-html-with-coldfusion-and-jsoup.htm