Search code examples
pythonurllib

Get size of a file before downloading in Python


I'm downloading an entire directory from a web server. It works OK, but I can't figure how to get the file size before download to compare if it was updated on the server or not. Can this be done as if I was downloading the file from a FTP server?

import urllib
import re

url = "http://www.someurl.com"

# Download the page locally
f = urllib.urlopen(url)
html = f.read()
f.close()

f = open ("temp.htm", "w")
f.write (html)
f.close()

# List only the .TXT / .ZIP files
fnames = re.findall('^.*<a href="(\w+(?:\.txt|.zip)?)".*$', html, re.MULTILINE)

for fname in fnames:
    print fname, "..."

    f = urllib.urlopen(url + "/" + fname)

    #### Here I want to check the filesize to download or not #### 
    file = f.read()
    f.close()

    f = open (fname, "w")
    f.write (file)
    f.close()

@Jon: thank for your quick answer. It works, but the filesize on the web server is slightly less than the filesize of the downloaded file.

Examples:

Local Size  Server Size
 2.223.533  2.115.516
   664.603    662.121

It has anything to do with the CR/LF conversion?


Solution

  • I have reproduced what you are seeing:

    import urllib, os
    link = "http://python.org"
    print "opening url:", link
    site = urllib.urlopen(link)
    meta = site.info()
    print "Content-Length:", meta.getheaders("Content-Length")[0]
    
    f = open("out.txt", "r")
    print "File on disk:",len(f.read())
    f.close()
    
    
    f = open("out.txt", "w")
    f.write(site.read())
    site.close()
    f.close()
    
    f = open("out.txt", "r")
    print "File on disk after download:",len(f.read())
    f.close()
    
    print "os.stat().st_size returns:", os.stat("out.txt").st_size
    

    Outputs this:

    opening url: http://python.org
    Content-Length: 16535
    File on disk: 16535
    File on disk after download: 16535
    os.stat().st_size returns: 16861
    

    What am I doing wrong here? Is os.stat().st_size not returning the correct size?


    Edit: OK, I figured out what the problem was:

    import urllib, os
    link = "http://python.org"
    print "opening url:", link
    site = urllib.urlopen(link)
    meta = site.info()
    print "Content-Length:", meta.getheaders("Content-Length")[0]
    
    f = open("out.txt", "rb")
    print "File on disk:",len(f.read())
    f.close()
    
    
    f = open("out.txt", "wb")
    f.write(site.read())
    site.close()
    f.close()
    
    f = open("out.txt", "rb")
    print "File on disk after download:",len(f.read())
    f.close()
    
    print "os.stat().st_size returns:", os.stat("out.txt").st_size
    

    this outputs:

    $ python test.py
    opening url: http://python.org
    Content-Length: 16535
    File on disk: 16535
    File on disk after download: 16535
    os.stat().st_size returns: 16535
    

    Make sure you are opening both files for binary read/write.

    // open for binary write
    open(filename, "wb")
    // open for binary read
    open(filename, "rb")