Search code examples
pythonurllib2

why is urllib2 taking a long time to read?


I'm writing a simple program to compare HTML pages but my current bottleneck is reading the HTML files. Specifically the code:

    htmldata1 = urllib2.urlopen(url1).read()
    htmldata2 = urllib2.urlopen(url2).read()

The url's are from IMDB. I don't know why it takes so long (average ~9 seconds). It may be downloading the images when I just want the html text to search with regular expressions. I have never used urllib2 so any help would be appreciated.

Edit:

An example url I use is

"http://www.imdb.com/title/tt0944947/fullcredits?ref_=tt_cl_sm#cast"


Solution

  • The page is just super-slow to load (on the server's end). This is on gigabit fiber:

    In [4]: url1 = "http://www.imdb.com/title/tt0944947/fullcredits?ref_=tt_cl_sm#cast"
    
    In [5]: %time result = urllib2.urlopen(url1).read()
    CPU times: user 56.3 ms, sys: 21.6 ms, total: 77.9 ms
    Wall time: 2.16 s
    
    In [7]: %time result2 = requests.get(url1)
    CPU times: user 29.5 ms, sys: 6.35 ms, total: 35.9 ms
    Wall time: 2.18 s
    

    And outside of python entirely:

    $ time curl -o/dev/null "http://www.imdb.com/title/tt0944947/fullcredits?ref_=tt_cl_sm#cast"
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 2173k    0 2173k    0     0   537k      0 --:--:--  0:00:04 --:--:--  540k
    curl -o/dev/null   0.01s user 0.03s system 0% cpu 4.074 total