Search code examples
phpcurlhttp-referer

PHP - cURL should I set 'AUTOREFERER' when following redirects?


TL;DR

Why should or shouldn't I set CURLOPT_AUTOREFERER => true in my cURL function (that follows a limited number of redirects)?

Long(er) Version

I have a pretty standard cURL function that return the headers for a given URL, following up-to 10 redirects...

const SINGLETIMEOUT = 8;  // Seconds (is this too long?)

public static function getHeaders($url, $userAgent) {
    // Initialize cURL object
    $curl = curl_init($url);

    // Set options
    curl_setopt_array($curl, array(
        CURLOPT_USERAGENT => $userAgent,

        CURLOPT_HEADER => true,
        CURLOPT_NOBODY => true,

        CURLOPT_RETURNTRANSFER => true,

        CURLOPT_FOLLOWLOCATION => true, 
        CURLOPT_MAXREDIRS => 10, 
        CURLOPT_AUTOREFERER => true, 

        CURLOPT_TIMEOUT => SINGLETIMEOUT,   // 5 seconds (safety!)
        CURLOPT_CONNECTTIMEOUT => SINGLETIMEOUT
    ));

    // Run it
    curl_exec($curl);

    // Get headers
    $headers = curl_getinfo($curl);

    // Close it
    curl_close($curl);

    return $headers;
}

The function getHeaders works great, exactly as expected. But so far in my testing, there is no difference in performance or results, whether I include CURLOPT_AUTOREFERER => true or not. There are plenty of references saying what CURLOPT_AUTOREFERER does, but beyond that I can't find anything going into more depth on that particular option.

Ok, so setting `` will

... automatically set the Referer: header field in HTTP requests where it follows a Location: redirect

So what? Why does this matter? Should I keep it in or toss it? Will it cause the results to be different for some URLs? Will some domains return erroneous headers, the same as when I send an empty user agent?

And on, and on...

Most of the examples I found to make this function did not include it - but they also didn't include many of the other options that I'm including.


Solution

  • Ok some basic information first: According to wikipedia:

    The HTTP referer (originally a misspelling of referrer) is an HTTP header field that identifies the address of the webpage (i.e. the URI or IRI) that linked to the resource being requested. By checking the referrer, the new webpage can see where the request originated. In the most common situation this means that when a user clicks a hyperlink in a web browser, the browser sends a request to the server holding the destination webpage. The request includes the referer field, which indicates the last page the user was on (the one where they clicked the link). Referer logging is used to allow websites and web servers to identify where people are visiting them from, for promotional or statistical purposes.

    However here's an important detail. This header is supplied by the client and the client can choose to supply it or can choose to not supply it. In addition if the client chooses to supply it then the client can supply any value it wants.

    Because of this developers have learned to not really rely on the referrer value they get for anything other than statistics because of how easily it can be spoofed (you can actually set the referrer header yourself in the cURL call if you want instead of using CURLOPT_AUTOREFERER).

    Therefore it's generally inconsequential to supply it when using crawlers or cURL. It's up to you if you want to let the remote site know where you came from. It should still work either way.

    That being said it's not impossible for a site to present different results based on the referrer, for example I had seen a site that was checking on whether the referrer was Google in order to supply additional in-site search results, but this is the exception and not the rule and other than that the sites should always be usable anyway.