I am trying to write a scraper in python using requests with proxies to scrape a https page. I found lists of free proxies on the internet and manually validated a bunch of them in an online proxy-checker. I also made sure to use only proxies that support https according to the website. But in python nearly all of them fail for http pages and ALL of them do not work for my desired https page. I did everythin according to the tutorials I found and I am running out of ideas what could possibly be the issue. I plan to look into the actual error messages without the try/except today, but I hoped someone could tell me if the code is valid in the first place.
def proxy_json_test_saved_proxies(self):
test_count = 1
timeout_seconds = 10
working_http = 0
working_https = 0
for proxy_dict in self.all_proxies:
print("#######")
print("Testing http proxy " + str(test_count) + "/" + str(len(self.all_proxies)))
test_count += 1
proxy = {'http':'http://' + proxy_dict["address"],
'https':'https://' + proxy_dict["address"]
}
print(proxy)
print("Try http connection:")
try:
requests.get("http://example.com", proxies = proxy, timeout = timeout_seconds)
except IOError:
print("Fail")
else:
print("Success")
working_http += 1
print("Try https connection:")
try:
requests.get("https://example.com", proxies = proxy, timeout = timeout_seconds)
except IOError:
print("Fail")
else:
print("Success")
working_https += 1
print("Working http: ", working_http)
print("Working https: ", working_https)
proxy_dict["address"] contains ip:port values like "185.247.177.27:80". self.all_proxies is a list of about 100 of those proxy_dicts.
I also know, that these free proxies might often times be already occupied. Thus I repeated the routine multiple times without ANY of them working for https and no real improvement in the http-count either.
me again. Solved the issue and wanted to post the answer. In the end it was just a typo in the proxy definition. The proxy server is reached via http, no matter if the goal url uses http or https.
I changed this:
proxy = {'http':'http://' + proxy_dict["address"],
'https':'https://' + proxy_dict["address"]
}
To this (deleted the "s" in https string):
proxy = {'http':'http://' + proxy_dict["address"],
'https':'http://' + proxy_dict["address"]
}
And now it works.