Search code examples
pythonweb-scrapingrecaptcha

GoogleCaptcha roadblock in website scraper


I am currently working on a scraper for aniworld.to. My goal is it to enter the anime name and get all of the Episodes downloaded. I have everything working except one thing... The websites has a Watch button. That Button redirects you to https://aniworld.to/redirect/SOMETHING and that Site has a captcha which means the link is not in the html... Is there a way to bypass this/get the link in python? Or a way to display the captcha so I can solve it? Because the captcha only appears every lightyear. The only thing I need from that page is the redirect link. It looks like this: https://vidoza.net/embed-something.html My very very wip code is here if it helps: https://github.com/wolfswolke/aniworld_scraper


Solution

  • Mitchdu showed me how to do it. If anyone else needs help here is my code: https://github.com/wolfswolke/aniworld_scraper/blob/main/src/logic/captcha.py

    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from webdriver_manager.chrome import ChromeDriverManager
    from selenium.webdriver.support.ui import WebDriverWait
    from threading import Thread
    
    import os
    def open_captcha_window(full_url):
        working_dir = os.getcwd()
        path_to_ublock = r'{}\extensions\ublock'.format(working_dir)
        options = webdriver.ChromeOptions()
        options.add_argument("app=" + full_url)
        options.add_argument("window-size=423,705")
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        if os.path.exists(path_to_ublock):
            options.add_argument('load-extension=' + path_to_ublock)
    
        driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
        driver.get(full_url)
    
        wait = WebDriverWait(driver, 100, 0.3)
        wait.until(lambda redirect: redirect.current_url != full_url)
    
        new_page = driver.current_url
        Thread(target=threaded_driver_close, args=(driver,)).start()
        return new_page
    
    
    def threaded_driver_close(driver):
        driver.close()