Search code examples
pythonhttpurllib2urlliburllib3

Python | HTTP - How to check file size before downloading it


I am crawling the web using urllib3. Example code:

from urllib3 import PoolManager

pool = PoolManager()
response = pool.request("GET", url)

The problem is that i may stumble upon url that is a download of a really large file and I am not interseted in downloading it.

I found this question - Link - and it suggests using urllib and urlopen. I don't want to contact the server twice.

I want to limit the file size to 25MB. Is there a way i can do this with urllib3?


Solution

  • If the server supplies a Content-Length header, then you can use that to determine if you'd like to continue downloading the remainder of the body or not. If the server does not provide the header, then you'll need to stream the response until you decide you no longer want to continue.

    To do this, you'll need to make sure that you're not preloading the full response.

    from urllib3 import PoolManager
    
    pool = PoolManager()
    response = pool.request("GET", url, preload_content=False)
    
    # Maximum amount we want to read  
    max_bytes = 1000000
    
    content_bytes = response.headers.get("Content-Length")
    if content_bytes and int(content_bytes) < max_bytes:
        # Expected body is smaller than our maximum, read the whole thing
        data = response.read()
        # Do something with data
        ...
    elif content_bytes is None:
        # Alternatively, stream until we hit our limit
        amount_read = 0
        for chunk in r.stream():
            amount_read += len(chunk)
            # Save chunk
            ...
            if amount_read > max_bytes:
                break
    
    # Release the connection back into the pool
    response.release_conn()