Look for Nested XML tag with Regex

This is my first post here, hoping to get some response. I've read through few similar posts and consensus is not to try parsing xml/html with regex but what I'm asking seems to be easier than the ones on other postings, so i'm giving it a shot.

I'm trying to find all the nested tags, here are some examples I want to catch: <a><a></a></a>

I don't want to catch <a></a><a></a>

So in plain english I want to catch all <a> following other <a> without having </a> in between them..and I want to look though the entire string so i should proceed even it sees a newline or linebreak

Hoping to have this problem solved. Thanks all!

Solution

I hope you are ready for parsing XML with regex.

First of all, let's define what XML tags would look like!

<tag_name␣(optional space (then whatever that doesnt end with "/"))>(whatever)</␣(optional space)tag_name>
<tag_name␣(optional space)/>

To match one of these tags we can then use the following regex:

/<[^ \/>]++ ?\/>|<([^ \>]++) ?[^>]*+>.*?<\/ ?\1>/s

Obviously, no tags are going to nest within our second type of XML tag. So our two-level nested regex would then be:

/<([^ \>]++) ?[^>]*+>.*?(?:<([^ \>]++) ?[^>]*+>.*?<\/ ?\2>|<[^ \/>]++ ?\/>).*?<\/ ?\1>/s

Now let's apply some recursion magic (Hopefully your regex engine supports recursion (and doesn't crash yet)):

/<([^ \>]++) ?[^>]*+>(.*?(?:<([^ \>]++) ?[^>]*+>(?:[^<]*+|(?2))<\/ ?\3>|<[^ \/>]++ ?\/>).*?)<\/ ?\1>/s

Done - The regex should do.

No seriously, try it out.

I stole an XML file fragment from w3schools XML tutorial and tried it with my regex, I copied a Maven project .xml from aliteralmind 's question and tried it with my regex as well. Works best with heavily nested elements.

_{(source: gyazo.com)}

Cheers.