For validating a URL path from user input, i'm using the PHP filter_var function. The input only contains the path (/path/path/script.php).
When validating the path, I add the host. I'm playing around a little bit, testing the input validation etc. Doing so, i notice a strange(??) behavior of the filter URL function.
Code:
$url = "http://www.domain.nl/http://www.google.nl/modules/authorize/test/normal.php";
var_dump(filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED)); //valid
Can someone explane why this is a valid URL? Thanks!
The short answer is, PHP FILTER_VALIDATE_URL checks the URL only against RFC 2396 and your URL, although weird, is valid according to said standard.
Long answer:
The filter you are using is declared to be compliant with RFC, so let's check that standard (RFC 2396).
The regular expression used for parsing a URL and listed there is:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
Where:
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
As we can see, the ":" character is reserved only in the context of scheme and from that point onwards ":" is fair game (this is supported by the text of the standard). For example, it is used freely in the http: scheme to denote a port. A slash can also appear in any place and nothing prohibits the URL from having a "//" somewhere in the middle. So "http://" in the middle should be valid.
Let's look at your URL and try to match it to this regexp:
$url = "http://www.domain.nl/http://www.google.nl/modules/authorize/test/normal.php";
//Escaped a couple slashes to make things work, still the same regexp
$result_rfc = preg_match('/^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?/',$url);
echo '<p>'.$result_rfc.'</p>';
The test returns '1' so this url is valid. This is to be expected, as the rules don't declare urls that have something like 'http://' in the middle to be invalid as we have seen. PHP simply mirrors this behaviour with FILTER_VALIDATE_URL.
If you want a more rigurous test, you will need to write the required code yourself. For example, you can prevent "://" from appearing more than once:
$url = "http://www.domain.nl/http://www.google.nl/modules/authorize/test/normal.php";
$result_rfc = preg_match('/^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?/',$url);
if (substr_count($url,'://') != 1) {
$result_non_rfc = false;
} else {
$result_non_rfc = $result_rfc;
}
You can also try and adjust the regular expression itself.