Search code examples
cookiespuppeteerapify

Apify: Preserve headers in RequestQueue


I'm trying to crawl our local Confluence installation with the PuppeteerCrawler. My strategy is to login first, then extracting the session cookies and using them in the header of the start url. The code is as follows:

First, I login 'by foot' to extract the relevant credentials:

const Apify = require("apify");

const browser = await Apify.launchPuppeteer({sloMo: 500});
const page = await browser.newPage();
await page.goto('https://mycompany/confluence/login.action');

await page.focus('input#os_username');
await page.keyboard.type('myusername');
await page.focus('input#os_password');
await page.keyboard.type('mypasswd');
await page.keyboard.press('Enter');
await page.waitForNavigation();

// Get cookies and close the login session
const cookies = await page.cookies();
browser.close();
const cookie_jsession = cookies.filter( cookie => {
    return cookie.name === "JSESSIONID"
})[0];
const cookie_crowdtoken = cookies.filter( cookie => {
    return cookie.name === "crowd.token_key"
})[0];

Then I'm building up the crawler structure with the prepared request header:

const startURL = {
    url: 'https://mycompany/confluence/index.action',
    method: 'GET',
    headers:
    {
        Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7',
        Cookie: `${cookie_jsession.name}=${cookie_jsession.value}; ${cookie_crowdtoken.name}=${cookie_crowdtoken.value}`,
    }
}

const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest(new Apify.Request(startURL));
const pseudoUrls = [ new Apify.PseudoUrl('https://mycompany/confluence/[.*]')];

const crawler = new Apify.PuppeteerCrawler({
    launchPuppeteerOptions: {headless: false, sloMo: 500 },
    requestQueue,
    handlePageFunction: async ({ request, page }) => {

        const title = await page.title();

        console.log(`Title of ${request.url}: ${title}`);
        console.log(page.content());

        await Apify.utils.enqueueLinks({
            page,
            selector: 'a:not(.like-button)',
            pseudoUrls,
            requestQueue
        });

    },
    maxRequestsPerCrawl: 3,
    maxConcurrency: 10,
});

await crawler.run();

The by-foot-login and cookie extraction seems to be ok (the "curlified" request works perfectly), but Confluence doesn't accept the login via puppeteer / headless chromium. It seems like the headers are getting lost somehow..

What am I doing wrong?


Solution

  • Without first going into the details of why the headers don't work, I would suggest defining a custom gotoFunction in the PuppeteerCrawler options, such as:

    {
        // ...
        gotoFunction: async ({ request, page }) => {
            await page.setCookie(...cookies); // From page.cookies() earlier.
            return page.goto(request.url, { timeout: 60000 })
        }
    }
    

    This way, you don't need to do the parsing and the cookies will automatically be injected into the browser before each page load.

    As a note, modifying default request headers when using a headless browser is not a good practice, because it may lead to blocking on some sites that match received headers against a list of known browser fingerprints.

    Update:

    The below section is no longer relevant, because you can now use the Request class to override headers as expected.

    The headers problem is a complex one involving request interception in Puppeteer. Here's the related GitHub issue in Apify SDK. Unfortunately, the method of overriding headers via a Request object currently does not work in PuppeteerCrawler, so that's why you were unsuccessful.