validity of facebook page whther it really exists and its url structure

I have spent a couple of hours on how to validate Facebook PAGES. I found and read lot of articles/posts but did not find something that matches my requirement. I wanted to convert a user input url ($rawurl) into a format that I want ($goodurl) and on googling, i found regex is the way to do it but it is very complicated and difficult to understand and need help.

The user can enter URL the way he/she likes example:

http://facebook.com/WillSmith, 
https://facebook.com/WillSmith, 
http://www.facebook.com/WillSmith, 
https://www.facebook.com/WillSmith, 
www.facebook.com/WillSmith 
or just facebook.com/WillSmith

Or any other way. Not only this, other than the vanity url format, the facebook pages also comes with other format like facebook.com/pages/usernames/somenumbers. Subdomains such as en-gb.facebook.com makes things more difficult. So after googling more, I found a regex http[s]?://(www|[a-zA-Z]{2}-[a-zA-Z]{2})\.facebook\.com/(pages/[a-zA-Z0-9\.-]+/[0-9]+|[a-zA-Z0-9\.-]+)[/]?$ but not sure if it will take care of all the above conditions.

Help what I need: 1. The standard format I need is https://www.facebook.com/WillSmith 2. I also need to check if it is a valid URL. eg.the above url is valid and if you see this url https://www.facebook.com/WillSmith555, it fits the valid criteria but there is no such page on Facebook. It says "Sorry, this page isn't available. The link you followed may be broken, or the page may have been removed" with a broken thumbsup picture.

After checking these two criteria, I need an echo in the php file whether the url entered by user is valid or invalid after doing a regex conversion.

Please help.

Solution

You can do a head only request do facebook:

<?php

    function header_req( $url )
    {
        $channel = curl_init();
        curl_setopt($channel, CURLOPT_URL, $url);
        curl_setopt($channel, CURLOPT_CONNECTTIMEOUT, 10);
        curl_setopt($channel, CURLOPT_TIMEOUT, 10);
        curl_setopt($channel, CURLOPT_HEADER, true);
        curl_setopt($channel, CURLOPT_NOBODY, true);
        curl_setopt($channel, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($channel, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201');
        curl_setopt($channel, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($channel, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4);
        curl_setopt($channel, CURLOPT_SSL_VERIFYPEER, FALSE);
        curl_setopt($channel, CURLOPT_SSL_VERIFYHOST, FALSE); 
        curl_exec($channel);
        $httpCode = curl_getinfo( $channel, CURLINFO_HTTP_CODE );
        curl_close($channel);
        return $httpCode;
    }

    $url = "https://www.facebook.com/WillSmith";

    //lets check the url for facebook as host:


    // 1 add http if not found in URL
    if ( stripos( $url , "http") !== 0)
        $url = "http://" . $url;


    // 2 get facebook.com from URL
    $host = parse_url( $url, PHP_URL_HOST );

    // 3 if host is indeed facebook.com then continue
    if ( stripos( $host , "facebook.com" ) )
    {
        $response = header_req($url);

        if ( $response === 200 || $response === 302 )
            echo "Page Found";
        else
            echo "Page Not Found";
    }

?>

Advantages of this :

It will get only headers of the page which will be around 1KB - 5 KB.
NO use of Regexp.
All pages are verified whatever the pattern is :)