Search code examples
pythonpython-3.xselenium-webdriverweb-scrapingpython-requests

Scrapers blocked but not browser


I am trying to scrape from https://www.rule34video.com/ using python

At first, it worked with a simple request.get(), however, the subsequent attempts failed on the next day. I did allow Windows to update in between. Not sure if it's the cause. I tried including headers:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
print(requests.get(url, headers=headers).text)

But this is what i get:

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='rule34video.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000025316C12430>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

Then I tried using selenium as my last resort, however, the results were the same, it can't access the website at all.

This is what I see on the loaded html page.

502 Bad Gateway

ProtocolException('Server connection to (\'rule34video.com\', 443) failed: Error connecting to "rule34video.com": [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond')

I am almost certain that my ip address is blacklisted, however when I use google chrome to visit https://rule34video.com/, it loaded with no problem at all.

My question is:

  1. How does google chrome not get blocked
  2. What can I do to bypass the scraping protection

Solution

  • Websites have different ways to detect scrapers and bots.

    After searching about it I can pass these protections using the undetected mode from seleniumbase framework.

    https://seleniumbase.io/help_docs/uc_mode/