I am trying to implement a proxy in a python scraper.
However it appears I cannot use the parameter proxies in urlopen() as suggested in the tutorial I saw (probably a version thing?!)
proxy = {'http' : 'http://example:8080' }
req = urllib.request.Request(Site,headers=hdr, proxies=proxy)
resp = urllib.request.urlopen(req).read()
So I have tried getting smart out of the documentation for request, where it is suggested to create an opener. However this has no parameter for headers. and suggest something like this opener.addheaders = []
Nothing what I tried worked.(testprints of proxy IPs are working)
The following constelation looks as the best practice to me but throws "cannot find file error". Not sure exactly why.
Would be nice if you could show me how to use the proxy together with a full header set.
Code:
import bs4 as bs
import urllib.request
import ssl
import re
from pprint import pprint ## for printing out a readable dict. can be deleted afterwards
#########################################################
## Parsing with beautiful soup
#########################################################
ssl._create_default_https_context = ssl._create_unverified_context
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
Site = 'https://example.com'
proxy = {'http' : 'http://example:8080' }
def openPage(Site, hdr):
## IP check
print('Actual IP', urllib.request.urlopen('http://httpbin.org/ip').read())
req = urllib.request.Request(Site,headers=hdr)
opener = urllib.request.FancyURLopener(proxy)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
## IP check
print('Fake IP', opener.open('http://httpbin.org/ip').read())
resp = opener.open(req).read()
## soup = bs.BeautifulSoup(resp,'lxml')
## return(soup)
soup = openPage(Site,hdr)....
ERROR:
Traceback (most recent call last): File "C:\Program Files\Python36\lib\urllib\request.py", line 1990, in open_local_file
stats = os.stat(localname) FileNotFoundError: [WinError 2] The system cannot find the file specified: 'urllib.request.Request object at 0x000001D94816A908'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:/Projects/Python/Programms/WebScraper/scraper.py", line 72, in <module>
mainNav() File "C:/Projects/Python/Programms/WebScraper/scraper.py", line 40, in mainNav
soup = openPage(Site,hdr,ean) File "C:/Projects/Python/Programms/WebScraper/scraper.py", line 32, in openPage
resp = opener.open(req).read() File "C:\Program Files\Python36\lib\urllib\request.py", line 1762, in open
return getattr(self, name)(url) File "C:\Program Files\Python36\lib\urllib\request.py", line 1981, in open_file
return self.open_local_file(url) File "C:\Program Files\Python36\lib\urllib\request.py", line 1992, in open_local_file
raise URLError(e.strerror, e.filename) urllib.error.URLError: <urlopen error The system cannot find the file specified>
The following code has been successful. I have changed from fancyURLopener to installing my own opener with the before defined proxy of its proxy function. The header has been added afterwards
def openPage(site, hdr, proxy):
## Create opener
proxy_support = urllib.request.ProxyHandler(proxy)
opener = urllib.request.build_opener(proxy_support)##proxy_support
urllib.request.install_opener(opener)
opener.addheaders = hdr