Search code examples
pythonhttp-headersetagweb-crawlerif-modified-since

Python: Optimal algorithm to avoid downloading unchanged pages while crawling


I am writing a crawler which regularly inspects a list of news websites for new articles. I have read about different approaches for avoiding unnecessary pages downloads, basically identified 5 header elements that could be useful to determine if the page has changed or not:

  1. HTTP Status
  2. ETAG
  3. Last_modified (to combine with If-Modified-Since request)
  4. Expires
  5. Content-Length.

The excellent FeedParser.org seems to implement some of these approaches.

I am looking for an optimal code in Python (or any similar language) that makes this kind of decision. Keep in mind that header info is always provided by the server.

That could be something like :

def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
    #retrieve the headers, do the magic here and return the decision
    return decision

Solution

  • The only thing you need to check before making the request is Expires. If-Modified-Since is not something the server sends you, but something you send the server.

    What you want to do is an HTTP GET with an If-Modified-Since header indicating when you last retrieved the resource. If you get back status code 304 rather than the usual 200, the resource has not been modified since then, and you should use your stored copy (a new copy will not be sent).

    Additionally, you should retain the Expires header from the last time you retrieved the document, and not issue the GET at all if your stored copy of the document has not expired.

    Translating this into Python is left as an exercise, but it should be straightforward to add an If-Modified-Since header to a request, to store the Expires header from the response, and to check the status code from the response.