Search code examples
node.jsherokupuppeteercloudflare

Bypass Cloudflare's captcha with headless chrome using puppeteer on Heroku


Im trying to access a site with headless chrome using puppeteer on Heroku. My setup works when I try it locally on my machine, but when trying it mounted on Heroku I get something like this: enter image description here

I understand that puppeteer comes with javascript enabled by default and for what I've read it looks like it has nothing to do with that.

Im using puppeteer-extra-plugin-stealth, random-useragent and viewport randomization but nothing seems to work.

Could it be that puppeteer and/or chrome is adding extra stuff when running locally vs on Heroku?

Here's my setup:

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const randomUseragent = require('random-useragent');

const USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36';


let browser = await puppeteer.launch(
  { headless: true, executablePath: process.env.CHROME_BIN || null, args: [
    '--enable-features=NetworkService', '--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage'
  ], ignoreHTTPSErrors: true, dumpio: false}
);
let page = await browser.newPage();
const userAgent = randomUseragent.getRandom();
const UA = userAgent || USER_AGENT;

//Randomize viewport size
await page.setViewport({
    width: 1920 + Math.floor(Math.random() * 100),
    height: 3000 + Math.floor(Math.random() * 100),
    deviceScaleFactor: 1,
    hasTouch: false,
    isLandscape: false,
    isMobile: false,
});

await page.setUserAgent(UA);
await page.setJavaScriptEnabled(true);
await page.setDefaultNavigationTimeout(0);
await page.goto('https://external.site.example', { waitUntil: 'networkidle0' });

...

Solution

  • I managed to fix my issue following Raphael PICCOLO's comment on how IP addresses might get detected. Nothing extra was being added or removed by my machine or Heroku, it was just the IP.

    I used a proxy which required proxy-chain in order to avoid getting net::ERR_NO_SUPPORTED_PROXIES error.

    My code ended up something like this:

    const puppeteer = require('puppeteer-extra');
    const StealthPlugin = require('puppeteer-extra-plugin-stealth');
    puppeteer.use(StealthPlugin());
    const randomUseragent = require('random-useragent');
    const proxyChain = require('proxy-chain');
    
    const USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) 
    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36';
    
    const oldProxyUrl = process.env.PROXY_SERVER;
    const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
    
    let browser = await puppeteer.launch(
      { headless: true, executablePath: process.env.CHROME_BIN || null, args: [
        '--no-sandbox', '--disable-setuid-sandbox', `--proxy-server=${newProxyUrl}`
      ], ignoreHTTPSErrors: true, dumpio: false}
    );
    let page = await browser.newPage();
    const userAgent = randomUseragent.getRandom();
    const UA = userAgent || USER_AGENT;
    
    //Randomize viewport size
    await page.setViewport({
        width: 1920 + Math.floor(Math.random() * 100),
        height: 3000 + Math.floor(Math.random() * 100),
        deviceScaleFactor: 1,
        hasTouch: false,
        isLandscape: false,
        isMobile: false,
    });
    
    await page.setUserAgent(UA);
    await page.setJavaScriptEnabled(true);
    await page.setDefaultNavigationTimeout(0);
    await page.goto('https://external.site.example', { waitUntil: 'networkidle0' });
    
    ...