I am writing a crawler which regularly inspects a list of news websites for new articles. I have read about different approaches for avoiding unnecessary pages downloads, basically identified 5 header elements that could be useful to determine if the page has changed or not:
The excellent FeedParser.org seems to implement some of these approaches.
I am looking for an optimal code in Python (or any similar language) that makes this kind of decision. Keep in mind that header info is always provided by the server.
That could be something like :
def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
#retrieve the headers, do the magic here and return the decision
return decision
The only thing you need to check before making the request is Expires
. If-Modified-Since
is not something the server sends you, but something you send the server.
What you want to do is an HTTP GET
with an If-Modified-Since
header indicating when you last retrieved the resource. If you get back status code 304
rather than the usual 200
, the resource has not been modified since then, and you should use your stored copy (a new copy will not be sent).
Additionally, you should retain the Expires
header from the last time you retrieved the document, and not issue the GET
at all if your stored copy of the document has not expired.
Translating this into Python is left as an exercise, but it should be straightforward to add an If-Modified-Since
header to a request, to store the Expires
header from the response, and to check the status code from the response.