Search code examples
phppinterestget-headers

PHP get_headers() fails with Pinterest


I'm currently working on a tool to integrates link of different social networks:

Facebook: https://www.facebook.com/jonathan.parentlevesque

Google plus: https://plus.google.com/+JonathanParentL%C3%A9vesque

Instagram: https://instagram.com/mariloubiz/

Pinterest: https://www.pinterest.com/jonathan_parl/

RSS: https://regex101.com

Twitter: https://twitter.com/arcadefire

Vimeo: https://vimeo.com/ondemand/crashtest/135301838

Youtube: https://www.youtube.com/user/Darkjo666

I'm using very basic regex like this one:

/^https?:\/\/(?:[a-z]{2}|[w]{3})?\.pinterest.com\/[\S]{5,}$/i

on client and server side for minimal domain validation on each links.

Then, I'm using this function to validate that the page really exists (it's useless to integrate social network links that don't work after all):

public static function isUrlExists($url){

    $exists = false;

    if(!StringManager::stringStartWith($url, "http") and !StringManager::stringStartWith($url, "ftp")){

        $url = "https://" . $url;
    }

    if (preg_match(RegularExpression::URL, $url)){

        $headers = get_headers($url);

        if ($headers !== false and !empty($headers)){

            if (strpos($headers[0], '404') === false){

                $exists = true;
            }   
        }
    }

    return $exists;
}

Note: In this function I'm using Diego Perini's regex for validating the URL before sending the request:

const URL = "%^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@|\d{1,3}(?:\.\d{1,3}){3}|(?:(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)(?:\.(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)*(?:\.[a-z\x{00a1}-\x{ffff}]{2,6}))(?::\d+)?(?:[^\s]*)?$%iu"; //@copyright Diego Perini

All the tested links so far didn't generate any error, but testing Pinterest produce me this quite scary series of error messages:

get_headers(): SSL operation failed with code 1. OpenSSL Error messages: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Array
(
    [url] => https://www.pinterest.com/jonathan_parl/
    [exists] => 
)

get_headers(): Failed to enable crypto

Array
(
    [url] => https://www.pinterest.com/jonathan_parl/
    [exists] => 
)

get_headers(https://www.pinterest.com/jonathan_parl/): failed to open stream: operation failed

Array
(
    [url] => https://www.pinterest.com/jonathan_parl/
    [exists] => 
)

Is anyone has an idea what I'm doing wrong here?

I mean, ain't Pinterest a popular social network with a valid certificate (I don't use it personally, I just created an account for testing)?

Thank you for your help,

Jonathan Parent-Lévesque from Montreal


Solution

  • I tried to create a self-signed certificate for my development environment (Xampp) as suggested by N.B. in his comment. That solution didn't worked for me.

    His other solution was to use cUrl or guzzle instead get_headers(). Not only it worked, but, according to this developper's tests:

    http://php.net/manual/fr/function.get-headers.php#104723

    it is also way faster than get_headers().

    For those interested, here's the code of my new function for those interested:

    /**
    * Send an HTTP request to a the $url and check the header posted back.
    *
    * @param $url String url to which we must send the request.
    * @param $failCodeList Int array list of codes for which the page is considered invalid.
    *
    * @return Boolean
    */
    public static function isUrlExists($url, array $failCodeList = array(404)){
    
        $exists = false;
    
        if(!StringManager::stringStartWith($url, "http") and !StringManager::stringStartWith($url, "ftp")){
    
            $url = "https://" . $url;
        }
    
        if (preg_match(RegularExpression::URL, $url)){
    
            $handle = curl_init($url);
    
    
            curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
    
            curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
    
            curl_setopt($handle, CURLOPT_HEADER, true);
    
            curl_setopt($handle, CURLOPT_NOBODY, true);
    
            curl_setopt($handle, CURLOPT_USERAGENT, true);
    
    
            $headers = curl_exec($handle);
    
            curl_close($handle);
    
    
            if (empty($failCodeList) or !is_array($failCodeList)){
    
                $failCodeList = array(404); 
            }
    
            if (!empty($headers)){
    
                $exists = true;
    
                $headers = explode(PHP_EOL, $headers);
    
                foreach($failCodeList as $code){
    
                    if (is_numeric($code) and strpos($headers[0], strval($code)) !== false){
    
                        $exists = false;
    
                        break;  
                    }
                }
            }
        }
    
        return $exists;
    }
    

    Let me explains the curl options:

    CURLOPT_RETURNTRANSFER: return a string instead of displaying the calling page on the screen.

    CURLOPT_SSL_VERIFYPEER: cUrl won't checkout the certificate

    CURLOPT_HEADER: include the header in the string

    CURLOPT_NOBODY: don't include the body in the string

    CURLOPT_USERAGENT: some site needs that to function properly (by example : https://plus.google.com)


    Additional note: I explode the header string and user headers[0] to be sure to only validate only the return code and message (example: 200, 404, 405, etc.)

    Additional note 2: Sometime validating only the code 404 is not enough (see the unit test), so there's an optional $failCodeList parameter to supply all the code list to reject.

    And, of course, here's the unit test to legitimates my coding:

    public function testIsUrlExists(){
    
    //invalid
    $this->assertFalse(ToolManager::isUrlExists("woot"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque4545646456"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://plus.google.com/+JonathanParentL%C3%A9vesque890800"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://instagram.com/mariloubiz1232132/", array(404, 405)));
    
    $this->assertFalse(ToolManager::isUrlExists("https://www.pinterest.com/jonathan_parl1231/"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://regex101.com/546465465456"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://twitter.com/arcadefire4566546"));
    
    $this->assertFalse(ToolManager::isUrlExists("https://vimeo.com/**($%?%$", array(400, 405)));
    
    $this->assertFalse(ToolManager::isUrlExists("https://www.youtube.com/user/Darkjo666456456456"));
    
    
    //valid
    $this->assertTrue(ToolManager::isUrlExists("www.google.ca"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://plus.google.com/+JonathanParentL%C3%A9vesque"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://instagram.com/mariloubiz/"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://www.pinterest.com/"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://regex101.com"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://twitter.com/arcadefire"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://vimeo.com/"));
    
    $this->assertTrue(ToolManager::isUrlExists("https://www.youtube.com/user/Darkjo666"));
    }
    

    I hope this solution will help someone,

    Jonathan Parent-Lévesque from Montreal