Search code examples
pythondjangopython-requestshttplib2

Checking if a URL exists and is smaller than x bytes without consuming full response


I have a use case where I want to check (from within a python/Django project) if a response to a GET request is smaller than x bytes, if the whole response completes within y seconds and if the response status is 200. The URL being tested is submitted by end users.

Some constraints:-

  1. HEAD request is not acceptable. Simply because some servers might not include a Content-Length, or lie about it, or simply block HEAD requests.
  2. I would not like to consume full GET response body. Imagine end user submitting url to 10GB file... all my server bandwidth(and memory) would be consumed by this.

tl;dr : Is there any python http api that:-

  1. Accepts a timeout for the whole transaction. (I think httplib2 does this)
  2. Response status is 200 (All http libraries do this)
  3. Kills the requests(perhaps with RST) once x bytes have been received to avoid bandwidth starvation.

The x here would probably be in order of KBs, y would be few seconds.


Solution

  • You could open the URL in urllib and read(x+1) from the returned object. If the length of the returned string is x+1, then the resource is larger than x. Then call close() on the object to close the connection, i.e. kill the request. In the worst case, this will fill the OS's TCP buffer, which is something you can not avoid anyway; usually, this should not fetch more than a few kB more than x.

    If you furthermore add a Range header to the request, sane servers will close the connection themselves after x+1 bytes. Note that this changes the reply code to 206 Partial Content, or 416 Requested range not satisfiable if the file is too small. Servers which do not support this will ignore the header, so this should be a safe measure.