This should be simple, but it's eluding me. There are many good and bad regex methods to match a URL, with or without the protocol, with or without www. The problem I have is this (in javascript): if I use regex to match URLs in a text string, and set it so that it will match just 'domain.com', it also catches the domain of an e-mail address (the part after '@'), which I don't want. A negative lookbehind solves it - but obviously not in JS.
This is my nearest success so far:
/^(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g
but that fails if the match is not at the start of the string. And I'm sure I'm tackling it the wrong way. Is there a simple answer out there anywhere?
EDIT: Revised regex to respond to a few of the comments below (sticks with 'www' rather than allowing sub-domains:
\b(www\.)?([^@])(\w*\.)(\w{2,3})(\.\w{2,3})?(\/\S*)?$
As mentioned in the comments however, this still matches the domain after a @.
Thanks
After a lot of messing about, this ended up working (with a definite hat tip to @zmo's final comment):
var rx = /\b(www\.)?(\w*@)?([a-zA-Z\-]*\.)(com|org|net|edu|COM|ORG|NET|EDU)(\.au)?(\/\S*)?/g;
var link = txt.match(rx);
if(link !== null) {
for(var i = 0; i < link.length; i++) {
if (link[i].indexOf('@') == -1) {
//create link
} else {
//create mailto;
}
}
}
I'm aware of the limitations with regard to sub-domains, TLDs, etc. (which@zmo has addressed above - and if you need to catch all URLs, I'd suggest you adapt that code), but that was not the main issue in my case. The code in my answer allows matches to URLs present in a text string without 'www.', without also catching the domain of an e-mail address.