python python-requests http-status-codes httplib2

HEAD and GET http request return different status code for the same URL

I'm trying to check if some URLs that are stored in my DB are still valid links or not. To achieve this, I am using httplib2 to request the HEAD status in order to avoid downloading the entire content of the page. And I was quite happy with the results.

But then I discovered some cases where the status code returned when using a HEAD request is not similar to the one returned with a GET request.

So, just in case of a bug in the library, I made some tests with different libraries (below is my "requests" lib test):

> import requests    

> rg = requests.get("https://fr.news.yahoo.com/chemin-dames-l-hommage-personnel-pr%C3%A9sident-121005844.html")
> rh = requests.head("https://fr.news.yahoo.com/chemin-dames-l-hommage-personnel-pr%C3%A9sident-121005844.html")

> print("GET status code:", rg.status_code)
  ('GET status code:', 200)

> print("HEAD status code:", rh.status_code)
  ('HEAD status code:', 404)

But whatever lib I use, I still have a different GET & HEAD status for the same URL.

So, obviously the site maintainer decided to not return an identical status code for both HEAD and GET request... and that seems legit even if not recommended.

Is there a way to avoid this problem and still know if the link is a valid one without having to download the entire content of the almost 2 millions url that I need to verify?

I can double check with a GET request whenever a >400 status code is returned on a HEAD request but that seems like a dirty work to me.

Solution

It seems that you might have to do it the GET way. While HEAD should return 200 when the page is live, there is simply no guarantee for that, and it's up to the server how to implement it. In addition, some would argue that HEAD request should return 404. The specification simply says this about 404:

This status code is commonly used when the server does not wish to reveal exactly why the request has been refused

You should also take into account all the errors and mistakes that can be present in the server implementation. A simple google search will show you just how many bugs of this sort there are. It may well be that HEAD returns 200, but GET is 404, so your suggested method of double checking only HEAD 404s, with GET requests won't be 100% reliable too.