I'm writing a simple program to compare HTML pages but my current bottleneck is reading the HTML files. Specifically the code:
htmldata1 = urllib2.urlopen(url1).read()
htmldata2 = urllib2.urlopen(url2).read()
The url's are from IMDB. I don't know why it takes so long (average ~9 seconds). It may be downloading the images when I just want the html text to search with regular expressions. I have never used urllib2 so any help would be appreciated.
Edit:
An example url I use is
"http://www.imdb.com/title/tt0944947/fullcredits?ref_=tt_cl_sm#cast"
The page is just super-slow to load (on the server's end). This is on gigabit fiber:
In [4]: url1 = "http://www.imdb.com/title/tt0944947/fullcredits?ref_=tt_cl_sm#cast"
In [5]: %time result = urllib2.urlopen(url1).read()
CPU times: user 56.3 ms, sys: 21.6 ms, total: 77.9 ms
Wall time: 2.16 s
In [7]: %time result2 = requests.get(url1)
CPU times: user 29.5 ms, sys: 6.35 ms, total: 35.9 ms
Wall time: 2.18 s
And outside of python entirely:
$ time curl -o/dev/null "http://www.imdb.com/title/tt0944947/fullcredits?ref_=tt_cl_sm#cast"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2173k 0 2173k 0 0 537k 0 --:--:-- 0:00:04 --:--:-- 540k
curl -o/dev/null 0.01s user 0.03s system 0% cpu 4.074 total