Search code examples
typescriptweb-scrapingplaywrightopensea

Playwright works in headful mode but fails in headless


im trying this sample to obtain the number of offers a NFT has in opensea:

import { test, expect } from '@playwright/test';

test('test', async ({ page }) => {
    await page.goto('https://opensea.io/assets/ethereum/0x63217dbb73e7a02c1d30f486e899ee66d0aa5e0b/6341');
    await page.waitForLoadState('networkidle');

    let selector = page.locator("[id='Body offers-panel'] li");
    const offers = await selector.count();

    console.log('Num of offers:', offers);
});

and then I run "npx playwright tests" what always print "Num of offers: 0"

But if I run it in --headed mode, it works perfectly and outputs "Num of offers: 5"

Can anyone explain/help me to understand it?

I tried using:

let selector = page.locator("[id='Body offers-panel'] li").waitFor();

Tried to wait until all requests are done

await page.waitForLoadState('networkidle');

tried to wait for the selector:

let selector = page.locator("[id='Body offers-panel'] li").first().waitFor();

But none worked, I always have 0 count unless I run the test in --headed mode, no matter of which NFT address I try.

I would like to solve it or understand why this happen


Solution

  • Headless mode makes it more obvious to servers that your script is a bot. You're being detected and blocked headlessly, but bypassing detection when running headfully.

    Since you can't see anything, headless is a bit harder to debug than headful. Using

    console.log(await page.content());
    await page.screenshot({path: "test.png", fullPage: true});
    

    are good tools for figuring out why elements you expect to be on the page aren't.

    In this case, adding

    const text = (await page.textContent("body"))
      .replace(/ +/g, " ")
      .replace(/(\n ?)+/g, "\n")
      .trim();
    console.log(text);
    

    after goto to get the full text content of the page gives:

    Access denied
    Error code 1020
    You do not have access to <Your URL>.The site owner may have set restrictions that prevent you from accessing the site.
    Error details
    Provide the site owner this information.
    I got an error when visiting <Your URL>.
    Error code: 1020
    Ray ID: **************
    Country: US
    Data center: *****
    IP: *****************
    Timestamp: 2023-02-17 22:39:13 UTC
    Click to copy
    Was this page helpful?
    Yes
    No
    Thank you for your feedback!
    Performance & security by Cloudflare
    

    It's not a perfect guarantee, but adding a user agent header is an easy option that seems to be enough to avoid headless detection on this particular site at this point in time:

    import {expect, test} from "@playwright/test"; // ^1.42.1
    
    const url = "<Your URL>";
    const userAgent =
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
    
    test.describe("with user agent", () => {
      test.use({userAgent});
    
      test("is able to retrieve offers", async ({page}) => {
        await page.goto(url, {waitUntil: "commit"});
        const offers = page.locator('[id="Body offers-panel"] li');
        await expect(async () => {
          expect(await offers.count()).toBeGreaterThanOrEqual(10);
        }).toPass();
      });
    });
    

    This works because the default Playwright headless user agent header explicitly says "I am a robot" by default, while headful uses a normal browser user agent. But keep in mind bot detection involves many more factors than just this. Using a rotating proxy service would be a more robust solution.

    For completeness, here's how to change the user agent in a non-test Playwright script, which is more typical in scraping:

    const playwright = require("playwright"); // ^1.42.1
    
    let browser;
    let context;
    (async () => {
      const url = "<Your URL>";
      const userAgent =
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
      browser = await playwright.firefox.launch();
      context = await browser.newContext({userAgent, bypassCSP: true});
      const page = await context.newPage();
      await page.goto(url, {waitUntil: "commit"});
      const sel = '[id="Body offers-panel"] li';
      await page.waitForFunction(
        `document.querySelectorAll('${sel}').length >= 10`
      );
      const offers = await page.locator(sel).count();
      console.log("Num of offers:", offers);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());