I am trying to do some web-scraping for a project for my study. Unfortunately I need to try and scrape some data of Google Scholar which blocks my requests. I have tried using (multiple) http proxies but my requests still get blocked after ~300 tries.
The resulting html from the blocked requests contains:
IP address: 145.109...<br/>Time: 2016-05-05T09:23:37Z<br/>URL:
https://scholar.google.nl/citations?hl=en&view_op=search_authors
&mauthors=Perry<br/>
The above IP is my own, while my proxies dict (it selects a proxy from a list at random) and get request look like this:
proxies = {'http': 'http://<username>:<password>@107.182....:<port>'}
result = requests.get('https://scholar.google.nl/citations?hl=en&
amp;view_op=search_authors&mauthors=Perry',
proxies=proxies, headers=headers)
The IPs of are of course valid and work and my own ip is not included in the proxy list. Am I doing something wrong?
Edit: For completeness, i have also tried setting authentication like this answer suggests but the result is the same.
In your proxies
dict the url scheme doesn't match the one you're using for your request, you use a http
entry for your proxies but then make a https
request. If you ad a proxy for the https
scheme, then it should work.