I am trying to read several websites, get the information that I need, and then move on. Though the python code hangs on some websites. I've noticed in real browsers that at random times, the website fails to completely load, maybe its waiting on some ads to load...?
The information that I need is within the first 50kb of the website. If I use a timeout, the entire response from the connection is lost in all of the modules that I have tried (urllib, urlib3, and pycurl). Also, in pycurl, set option RANGE does not seem do anything for the url.
Does anyone know how to save the content already received upon calling a timeout. Or, does someone know how to effectively limit the content to a certain number of bytes?
I found that pycurl will still write to the buffer until the timeout. If a timeout occurs, the error can be caught and the buffer retrieved. Here is the code that I used.
try:
buffer = BytesIO()
http_curl = pycurl.Curl()
http_curl.setopt(pycurl.URL, url)
http_curl.setopt(http_curl.WRITEDATA, buffer)
http_curl.setopt(http_curl.FOLLOWLOCATION, True)
http_curl.setopt(http_curl.TIMEOUT_MS, 1000)
http_curl.perform()
http_curl.close()
except pycurl.error:
response = buffer.getvalue()
response = response.decode('utf-8')
print(response)
pass
The page was partially downloaded and then printed. Thanks to t.m.adam for stimulating a work around.