I'm making a price scraping program and have ran into the issue of antiscraping systems. I managed to get around these with the undetected_chromedriver but now I'm running into 2 issues
the first is that the UC is significantly slower than the standard chrome driver, through I need it for some sites, so I have some sites scraped with a normal driver and others with the UC
the second problem is that I have the standard Chrome driver install at the beginning of the program, but once I do that, the UC feels the need to install every time I open it?? this causes some sites to be scraped really slowly. can you help with why that is? and any other tips for running scraper faster would be appreciated.
I have this run at the beginning of the program as global variables:
chrome_path = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])
and this runs as a function every time I need a UC:
def start_uc():
options = webdriver.ChromeOptions()
# just some options passing in to skip annoying popups
options.add_argument('--no-first-run --no-service-autorun --password-store=basic')
driver = uc.Chrome(options=options)
return driver
My scraping functions just loop looking up the url and scrape the info, and restart the driver to clear the cookies if I run into a captcha .The scraping functions look like this (this is psuedo code to give you an idea):
driver = start_uc()
for url in url_list:
while true:
#scrape info
driver = start_uc()
I dont see why chrome_path
would affect the UC? and are there any suggestions to make the scraping functions run more efficiently? Im not an expert on drivers and their intricacies so I could be doing something terribly wrong that I dont recognize.
thankyou in advance!
You can use https://github.com/seleniumbase/SeleniumBase to speed things up. (It has a special undetected-chromedriver mode that works with headless mode.)
pip install -U seleniumbase
And then run the following with python
from seleniumbase import Driver
from seleniumbase import page_actions
driver = Driver(headless=True, uc=True)
page_actions.wait_for_text(driver, "OH YEAH, you passed!", "h1")
print(driver.find_element("css selector", "body").text)
screenshot_name = "now_secure_image.png"
print("\nScreenshot saved to: %s" % screenshot_name)