Search code examples
phpget-headers

PHP - for get_headers($url, 1), are the keys for status codes *always* integers?


Looking at the PHP docs for get_headers()...

array get_headers ( string $url [, int $format = 0 ] )

... there are two ways to run it:

#1 (format === 0)

$headers = get_headers($url);

// or

$headers = get_headers($url, 0);

#2 (format !== 0)

$headers = get_headers($url, 1);

The difference between the two being whether the arrays are numerically indexed (first case)...

(excerpt from docs)

Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Sat, 29 May 2004 12:28:13 GMT
    [2] => Server: Apache/1.3.27 (Unix)  (Red-Hat/Linux)
    ... etc

... or indexed with keys (second case)...

(excerpt from docs)

Array
(
    [0] => HTTP/1.1 200 OK
    [Date] => Sat, 29 May 2004 12:28:14 GMT
    [Server] => Apache/1.3.27 (Unix)  (Red-Hat/Linux)
    [Last-Modified] => Wed, 08 Jan 2003 23:11:55 GMT
    ... etc

In the example given in the docs, the http status code belongs to a numerical index...

[0] => HTTP/1.1 200 OK

... regardless of what format is set to.

Similarly, in every valid URL that I have ever put through get_headers (i.e. many URLs), the status codes have always been under numerical indexes, even when multiple status codes present...

// Output from JSON.stringify(get_headers($url, 1))

{
    "0": "HTTP/1.1 301 Moved Permanently",
    "1": "HTTP/1.1 200 OK",
    "Date": [
        "Thu, 11 Aug 2016 07:12:28 GMT",
        "Thu, 11 Aug 2016 07:12:28 GMT"
    ],
    "Content-Type": [
        "text/html; charset=iso-8859-1",
        "text/html; charset=UTF-8"
    ]
    ... etc

But, I have not (read: cannot) test every URL on every type of server, and so cannot speak in absolutes about the status code indexes.

Is it possible that get_headers($url, 1) could return a non-numerical http status code index? Or is it hard-coded into the function to always return the status codes under numerical indices - no matter what?


Extra reading, not necessary or essential to the question above...

For the curious, my question is mostly to do with optimization. get_headers() is already painfully slow - even when sending a HEAD request instead of GET - and only gets worse after combing through the return array with a preg_match and regex.

(The various CURL methods you'll find are even slower, I've tested them against get_headers() with very long lists of URLs, so holster that hip-shot, partner)

If I know that the status codes are always numerically indexed, then I can speed my code up a bit, by ignoring all non-integer indices, before running them through the preg_match. The difference for one URL might only be fractions of a second, but when running this function all day, every day, those little bits add up.

Additionally (Edit #1)

I'm currently only worried about the final http status code (and URL), after all redirects. I was using a method similar to this to get the final URL.

It seems that after running

$headers = array_reverse($headers);

then the final status code after the redirects will always be in $headers[0]. But, once again, this only is a sure-thing if the status codes are numerically indexed.


Solution

  • The PHP C source code for that function looks like this:

            if (!format) {
    no_name_header:
                add_next_index_str(return_value, zend_string_copy(Z_STR_P(hdr)));
            } else {
                char c;
                char *s, *p;
    
                if ((p = strchr(Z_STRVAL_P(hdr), ':'))) {
                    ... omitted ...
                } else {
                    goto no_name_header;
                }
            }
    

    In other words, it tests if there's a : in the header, and if so proceeds to index it by its name (omitted here). If there's no : or if you did not request to $format the result, no_name_header kicks in and it adds it to the return_value without explicit index.

    So, yes, the status lines should always be numerically indexed. Unless the server puts a : into the status line, which would be unusual. Note that RFC 2616 does not explicitly prohibit the use of : in the reason phrase part of the status line:

    Status-Line    = HTTP-Version SP Status-Code SP Reason-Phrase CRLF
    
    Reason-Phrase  = *<TEXT, excluding CR, LF>
    
    TEXT           = <any OCTET except CTLs,
                     but including LWS>
    

    There is no standardised reason phrase which contains a ":", but you never know, you may encounter exotic servers in the wild which defy convention here…