I have a regular expression that grabs BBcode tags. It works great except for a minor glitch.
Here is the current expression:
\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]
Here is some text it successfully matches against and the groups it builds:
[url=http://www.google.com]Go to google![/url]
1: url
2: http://www.google.com
3: Go to google![img]http://www.somesite.com/someimage.jpg[/img]
1: img
2: NULL
3: http://www.somesite.com/someimage.jpg[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]
1: quote
2: NULL
3: [quote]first nested quote[/quote][quote]second nested quote[/quote]
All of this is great. I can handle nested tags by running the 3rd match group against the same regex and recursively handle all tags that are nested. The problem is with the example using the [quote] tags. Notice that the 3rd match group is a set of two quote tags, so we would expect two matches. However, we get one match, like this:
[quote]first nested quote[/quote][quote]second nested quote[/quote]
1: quote
2: NULL
3: first nested quote[/quote][quote]second nested quote
Ahhhh! That's not what we wanted at all. There is a fairly simple way to fix it, I modify the regex from this:
\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]
To this:
\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](((?!\[/\1\]).)+)\[/\1\]
By adding ((?!\[/\1\]).)
we invalidate the entire match if the 3rd match group contains the closing BBcode tag. So now this works, we get two matches:
[quote]first nested quote[/quote][quote]second nested quote[/quote]
[quote]first nested quote[/quote]
1: quote
2: NULL
3: first nested quote[quote]second nested quote[/quote]
1: quote
2: NULL 3: second nested quote
I was happy that fixed it, but now we have another problem. This new regex fails on the first one where we nest the two quote tags under one larger quote tag. We get two matches instead of one:
[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]
[quote][quote]first nested quote[/quote]
1: quote
2: NULL
3: [quote]first nested quote[quote]second nested quote[/quote]
1: quote
2: NULL
3: second nested quote
The first match is all wrong and the second match, while well-formed, is not a desired match. We wanted one big match with the 3rd match group being the two nested quote tags, like when we used the first expression.
Any suggestions? If I can just cross this gap I should have a fairly powerful BBcode expression.
Using balancing groups you can construct a regex like this:
(?>
\[ (?<tag>[^][/=\s]+) \s*
(?: = \s* (?<val>[^][]*) \s*)?
]
)
(?<content>
(?>
\[(?<innertag>[^][/=\s]+)[^][]*]
|
\[/(?<-innertag>\k<innertag>)]
|
[^][]+
)*
(?(innertag)(?!))
)
\[/\k<tag>]
Simplified according to Kobi's example.
In the following:
[foo=bar]baz[/foo]
[b]foo[/b]
[i][i][foo=bar]baz[/foo]foo[/i][/i]
[i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
[quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]
It finds these matches:
[foo=bar]baz[/foo]
[b]foo[/b]
[i][i][foo=bar]baz[/foo]foo[/i][/i]
[i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
[quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]
Full example at http://ideone.com/uULOs
(Old version http://ideone.com/AXzxW)