Search code examples
phpregexformstextareapcre

Looking for tips to better understand Perl Compatible Regular Expression operators and syntax


My question is about Perl Compatible Regular Expression operators and syntax. I've learned the basic syntax of '/hello/' and that /i means case insensitive. I looked into this at jotform.com and will study this until I have a greater understanding. But I was hoping someone could give me a head start on understanding the Perl Syntax and Operators in the (2) PCRE I've posted below. They both work to keep users from posting links in the form textarea, but are very different in syntax and operators. Just wanting to know if one regex is preferred over the other. Which is best and why?

Update: After several months of live testing, it appears that PCRE 1 does not work to prevent URLs in PHP contact form. PCRE 2 does seem to work to prevent URLs in PHP contact form for the same live testing time period.

The 2 regex below were originally found here at How to prevent spam URLs in a PHP contact form

Is there is a better regex than PCRE 2? Any help or advice would be greatly appreciated.

Thanks.

<?php

//PCRE 1 - Does not work to prevent URLs 

if (preg_match( '/www\.|http:|https:\/\/[a-z0-9_]+([\-\.]{1}[a-z_0-9]+)*\.[_a-z]{2,5}'.'((:[0-9]{1,5})?\/.*)?$/i', $_POST['message']))
{
echo 'error please remove URLs';
}else
{....

//PCRE 2 - Does work to prevent URLs 

if (preg_match("/\b(?:(?:https?|ftp|http):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$_POST['message']))
{
echo 'error please remove URLs';
}else
{....

?>

Solution

  • For the sake of offering an answer so that this page can be marked as resolved (instead of abandoned), I'll offer a refinement of the second pattern.

    /\b(?:(?:https?|ftp|http):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i
    

    can be rewritten as:

    \b(?:(?:f|ht)tps?:\/\/)[-\w+&@#\/%?=~|!:,.;]*[-\w+&@#\/%=~|]
    
    • The first segment matches https, http, ftps, or ftp as a "whole word" (\b) using alternation (|) and the zero or one quantifier (?). Your original pattern requires the "protocol" portion of the url to exist, so I will not change the pattern logic.
    • The subdomain in your pattern is requiring www. although the subdomain is not required in a valid url and there are valid values other than www. that can be used. I am going to change the pattern logic on this segment to make the subdomain optional and more flexible.
    • The character class (whitelisted characters) incorporates the characters in www., so the literal match can be omitted from the pattern.
    • I have reduced the length of both of your character classes by employing \w -- it includes all alphanumeric characters (uppercase and lowercase) as well as the underscore.
    • Here is a demonstration of what is matched: https://regex101.com/r/TP16iB/1 -- you will find that a valid url like www.example.com is not matched by your preferred pattern nor my pattern. To overcome this, you could hardcode the www. as the required subdomain and make the protocol optional, but then you would not be matching variable subdomains. So you see, this is a bit of a rabbit hole where you will need to weigh up how much time you wish to invest versus what your application really needs. Be warned, the more accurate your pattern becomes, so grows its total length/convolution.
      \b(?:(?:(?:f|ht)tps?:\/\/)|(?:www\.))\[-\w+&@#\/%?=~|!:,.;\]*\[-\w+&@#\/%=~|\]