My question is about Perl Compatible Regular Expression operators and syntax. I've learned the basic syntax of '/hello/' and that /i means case insensitive. I looked into this at jotform.com and will study this until I have a greater understanding. But I was hoping someone could give me a head start on understanding the Perl Syntax and Operators in the (2) PCRE I've posted below. They both work to keep users from posting links in the form textarea, but are very different in syntax and operators. Just wanting to know if one regex is preferred over the other. Which is best and why?
Update: After several months of live testing, it appears that PCRE 1 does not work to prevent URLs in PHP contact form. PCRE 2 does seem to work to prevent URLs in PHP contact form for the same live testing time period.
The 2 regex below were originally found here at How to prevent spam URLs in a PHP contact form
Is there is a better regex than PCRE 2? Any help or advice would be greatly appreciated.
Thanks.
<?php
//PCRE 1 - Does not work to prevent URLs
if (preg_match( '/www\.|http:|https:\/\/[a-z0-9_]+([\-\.]{1}[a-z_0-9]+)*\.[_a-z]{2,5}'.'((:[0-9]{1,5})?\/.*)?$/i', $_POST['message']))
{
echo 'error please remove URLs';
}else
{....
//PCRE 2 - Does work to prevent URLs
if (preg_match("/\b(?:(?:https?|ftp|http):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$_POST['message']))
{
echo 'error please remove URLs';
}else
{....
?>
For the sake of offering an answer so that this page can be marked as resolved (instead of abandoned), I'll offer a refinement of the second pattern.
/\b(?:(?:https?|ftp|http):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i
can be rewritten as:
\b(?:(?:f|ht)tps?:\/\/)[-\w+&@#\/%?=~|!:,.;]*[-\w+&@#\/%=~|]
https
, http
, ftps
, or ftp
as a "whole word" (\b
) using alternation (|
) and the zero or one quantifier (?
). Your original pattern requires the "protocol" portion of the url to exist, so I will not change the pattern logic.www.
although the subdomain is not required in a valid url and there are valid values other than www.
that can be used. I am going to change the pattern logic on this segment to make the subdomain optional and more flexible.www.
, so the literal match can be omitted from the pattern.\w
-- it includes all alphanumeric characters (uppercase and lowercase) as well as the underscore.www.example.com
is not matched by your preferred pattern nor my pattern. To overcome this, you could hardcode the www.
as the required subdomain and make the protocol optional, but then you would not be matching variable subdomains. So you see, this is a bit of a rabbit hole where you will need to weigh up how much time you wish to invest versus what your application really needs. Be warned, the more accurate your pattern becomes, so grows its total length/convolution.\b(?:(?:(?:f|ht)tps?:\/\/)|(?:www\.))\[-\w+&@#\/%?=~|!:,.;\]*\[-\w+&@#\/%=~|\]