Search code examples
seleniumwebdriverdetectiontorbrute-force

How to avoid detection when accessing website through TOR browser with selenium?


I have been trying to scrape websites for a while now, and when you apply brute force to retrieve all information on 500.000+ urls from one website, you can get blocked. Therefore, I am now trying to scrape my data through TOR browser with selenium webdriver. So far so good. Got it up and running:

from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import os

torexe = os.popen(r'C:/location_to/Tor Browser/Browser/TorBrowser/Tor/tor.exe')
profile = FirefoxProfile(r"C:/location_to/Tor Browser/Browser/TorBrowser/Data/Browser/Caches/profile.default")
profile.set_preference('network.proxy.type', 1)
profile.set_preference('network.proxy.socks', '127.0.0.1')
profile.set_preference('network.proxy.socks_port', 9050)
profile.set_preference("network.proxy.socks_remote_dns", False)
profile.update_preferences()
driver = webdriver.Firefox(firefox_profile= profile, executable_path=r'C:/Location_to/geckodriver-v0.25.0-win64/geckodriver.exe')
driver.get("http://check.torproject.org")

Resulting in: Congratulations. This browser is configured to use Tor. Your IP address appears to be: 94.230.208.147

Great. However, when I try to access certain websites I get detected:

driver.get("https://gearbest.com")
raw_html = driver.page_source
clean_html = soup(raw_html, 'html.parser')

Access Denied You don't have permission to access "http://gearbest.com/" on this server. Reference #18.cff31502.1569612654.932f460

Most websites do not detect me, it is just a handful. I have tried a bunch of "solutions" but posting them would most likely be more confusing than helping. It could be headless detection, but again, I am not sure. Who can help me here?

Thank you in advance.


Solution

  • A complete list of all websites blocked when accessed through TOR can be found here: https://trac.torproject.org/projects/tor/wiki/org/doc/ListOfServicesBlockingTor Ad-hoc solutions can be found here to circumvent the blockage which involve fetching content via other websites.

    For security reasons, I have switched to autoVPN (in linux in vm), which is free, not blocked by the target website, and provides high-end privacy.