so I'm building a small CMS and I'd like to avoid allowing HTML in the content editor. For that reason I want to detect raw URLs in text aswell as supporting BB-like tags, for better customization.
www.example.com
[link http://www.example.com]Click me[/link]
Unfortunately I'm fairly new to regular expressions and I just can't seem to get this working. I'm running two regular expressions over the string: The first detects raw URLs, the second BB-like URLs. The latter seems to work perfectly fine, the first one interferes though, and converts URLs wrapped in tags too.
I started off with a piece of code I found here and made some additions.
This is the code for non-tag URLs:
/* don't match URLs preceeded by '[link ' */
(?<!\[link\s)
(
/* match all combinations of protocol and www. */
(\bhttps?://www\.|\bhttps?://|(?<!//)\bwww\.)
/* match URL (no changes made here) */
([^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
/* but don't match if followed by [/link] - THIS DOESN'T WORK */
(?!\[/link\])
)
The negative look-behind before the www.
is there because /
isn't a word character, and without it something like
[link http://www.example.com]example[/link]
would still match after http://
.
The regex above produces the following matches (tested with http://gskinner.com/RegExr/, matches are in bold. I had to add spaces after http://
because I'm not allowed to post more URLs):
www.example.com
http:// www.example.com
http:// example.com
[link http://www.example.com]no problem 1[/link]
[link www.example.com]no problem 2[/link]
[link http://www.example.com]http://www.example.com[/link]
I've tried moving the negative look-ahead around and played with the parentheses (pretty aimlessly), without success.
For completeness, here's the tag-matching regex (which seems to work):
(?:\[link\s)(\bhttps?://|\bwww\.|\bhttps?://www\.)([^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))\](.*)(?:\[/link\])
I'm sure someone can spot the error immediately.
Thanks a lot in advance!
I have taken your regex, insterted it into regexr with the examples you have given and tried to make it work.
Step by step:
1) The original regex: http://regexr.com?33snj. The problem why this regex also matches the [/link] is in the URL matching bit:
[^\s()<>]+
This will also match the open bracket character '[', therefore matching will not stop when it encounters the [/link] bit. It could be argued that the [ character is a valid URI character, but that is only under rare conditions (see this stackoverflow post for more info).
2) I decided to continue with your regex, but added the open bracket char to the negated character list:
[^\s()<>[]+
This will get you into another problem. See http://regexr.com?33snp. Because of backtracking the engine now finds a way around the negative lookahead at the end.
3) Once you make the URL matching group atomic (by adding ?> to the start of the capture group) the engine stops backtracking and we have arrived at the desired outcome.
(?<!\[link\s)((\bhttps?://www\.|\bhttps?://|(?<!//)\bwww\.)(?>[^\s()<>[]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[/link\]))
See it in action http://regexr.com?33sns.