This is my first post here, hoping to get some response. I've read through few similar posts and consensus is not to try parsing xml/html with regex but what I'm asking seems to be easier than the ones on other postings, so i'm giving it a shot.
I'm trying to find all the nested tags, here are some examples
I want to catch:
<a><a></a></a>
I don't want to catch
<a></a><a></a>
So in plain english I want to catch all
<a>
following other <a>
without having </a>
in between them..and I want to look though the entire string so i should proceed even it sees a newline or linebreak
Hoping to have this problem solved. Thanks all!
I hope you are ready for parsing XML with regex.
First of all, let's define what XML tags would look like!
<tag_name␣(optional space (then whatever that doesnt end with "/"))>(whatever)</␣(optional space)tag_name>
<tag_name␣(optional space)/>
To match one of these tags we can then use the following regex:
/<[^ \/>]++ ?\/>|<([^ \>]++) ?[^>]*+>.*?<\/ ?\1>/s
Obviously, no tags are going to nest within our second type of XML tag. So our two-level nested regex would then be:
/<([^ \>]++) ?[^>]*+>.*?(?:<([^ \>]++) ?[^>]*+>.*?<\/ ?\2>|<[^ \/>]++ ?\/>).*?<\/ ?\1>/s
Now let's apply some recursion magic (Hopefully your regex engine supports recursion (and doesn't crash yet)):
/<([^ \>]++) ?[^>]*+>(.*?(?:<([^ \>]++) ?[^>]*+>(?:[^<]*+|(?2))<\/ ?\3>|<[^ \/>]++ ?\/>).*?)<\/ ?\1>/s
Done - The regex should do.
No seriously, try it out.
I stole an XML file fragment from w3schools XML tutorial and tried it with my regex, I copied a Maven project .xml
from aliteralmind's question and tried it with my regex as well. Works best with heavily nested elements.
(source: gyazo.com)
Cheers.