To extract URLs (not a perfect solution but I'm almost satisfied as performance counts) I use
preg_match_all('#\bhttps?://[^,\s()<>]+(?:\([\w\d]+\)|([^,[:punct:]\s]|/))#', $string, $match);
code to extract URLs.
However, it's not a perfect solution for me as URLs should be forced to cut up to ]
or "|"
if any of these two symbols met in the extracted URL.
I know these symbols are valid symbols in URLs, however for my case they should be invalid.
How should the preg_match_all
above be slightly modified to know about these two delimiters?
Thank you.
[:punct:]
is a short for [!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_``{|}~]
.
In your regex you are using a negated character class [^,[:punct:]\s]
that could be written as: [^!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_``{|}~\s]
(I've removed the first comma because it already exists and I've dupplicate backquote for highlight).
If you want to allow ]
and |
, remove them from the character class:
[^!"\#$%&'()*+,\-./:;<=>?@\[\\^_`{}~\s]