I am trying to scrape from https://www.rule34video.com/
using python
At first, it worked with a simple request.get()
, however, the subsequent attempts failed on the next day. I did allow Windows to update in between. Not sure if it's the cause. I tried including headers:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
print(requests.get(url, headers=headers).text)
But this is what i get:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='rule34video.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000025316C12430>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
Then I tried using selenium as my last resort, however, the results were the same, it can't access the website at all.
This is what I see on the loaded html page.
502 Bad Gateway
ProtocolException('Server connection to (\'rule34video.com\', 443) failed: Error connecting to "rule34video.com": [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond')
I am almost certain that my ip address is blacklisted, however when I use google chrome to visit https://rule34video.com/
, it loaded with no problem at all.
My question is:
Websites have different ways to detect scrapers and bots.
After searching about it I can pass these protections using the undetected mode from seleniumbase framework.