Search code examples
web-scrapingrecaptchacaptchaapify2captcha

How does 2captcha and apify like captcha solving services replicate my captcha using data-sitekey internally?


As I understand from various blogs that sites like 2captcha is a human-powered image and CAPTCHA recognition service. It's main purpose is solving your CAPTCHAs in a quick and accurate way by human employees who are always online to receive my captcha and solves the same on their end.

Now lets take an example of https://www.google.com/recaptcha/api2/demo. Say a captcha was generated, 2captcha like services needs data-sitekey which are generated for every captcha.

data-sitekey="6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-"

Now what I don't understand here is that how does captcha solver works replicate/reproduce the captcha on their end using just the data-sitekey. Is there any service provided by google to replicate the same?

How does the human on other end receives the same captcha on their side, solves it and sends it back?


Solution

  • This is quit late to answer this, but still this may help somebody in future.

    I also had this question in my mind and I started analysing it. I went through several websites, blogs and research papers and found how it works internally.

    So below are the things that I understood from captcha implementation.

    1. The data-sitekey is associated with the website and before loading the captcha, google verifies if this key is coming from associated domain by verifying document.location.hostname.
    2. When user solves reCaptcha, it generates g-recaptcha-response token which is nothing but the captcha solution based on your browser history, google.com cookies and other browser data.
    3. This token is then validated by backend server by calling Google API and passing shared secret key between Google and your website.

    How these captcha solver services works

    1. Expect data-sitekey and website-url from user.
    2. Create a html page which will have reCaptcha in it with user provided data-sitekey.
    3. Update the hosts file by adding an entry of the user provided website-url and point it to 127.0.0.1
    4. Open this html page on any web-server installed on local machine and try to access the URL using user provided website-url as it is pointing to 127.0.0.1. This way, google will consider the request is coming from valid website and it will generate the reCaptcha.
    5. Once this reCaptcha is solved, the g-recaptcha-token is generated and is valid for ~120 seconds, this token will then given back to user for further steps.
    6. User have to insert this token inside a text-area which has an id of g-recaptcha-response and then submit the page.

    References

    I have explained this working in my youtube video Selenium automation of a website having google recaptcha .

    The source doesn't exists on github because I deleted my github account. If I can recover the source code, I will add it in my gitlab repository NiRRaNjAN RauT · GitLab.

    Research paper I’m not a human: Breaking the Google reCAPTCHA

    Based on this knowledge, I have build my own captcha solver service Fast Captcha Solver in affordable price.