I am trying to save the static content of HTML. However, I see dynamic content like the script is what got captured. Is there a way to capture the raw content ?
Please find the sample code here
import {chromium} from 'playwright'; // Web scraper Library import * as fs from 'fs';
(async function () {
const chromeBrowser = await chromium.launch({ headless: true }); // Chromium launch and options
const context = await chromeBrowser.newContext({ ignoreHTTPSErrors: true ,
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
});
const page = await context.newPage();
await page.goto("https://emposedesigns.wixsite.com/empose/games", { waitUntil: 'networkidle', timeout: 60000 });
let content = await page.content();
fs.writeFileSync('test.html', content);
console.log("done")
})();
When web scraping, after determining what your goal is, it's important to think about how you'd achieve the goal as a normal visitor to the site. Although some shortcuts exist (and are usually taken for web scraping, but not for testing), for the most part, Playwright is designed to replicate the user's actions 1:1.
The goal here is to get the text of the privacy policy. If we navigate to the page as a user, no such privacy policy is visible. It's possible that the policy is in the HTML statically. We can check that by viewing page source, but in this case it's not present.
The policy is shown after clicking a link that has the text "Privacy Policy". After the browser renders the change triggered by the click, there's an iframe that contains the policy.
Here's one way to replicate this in Playwright:
const fs = require("node:fs/promises");
const playwright = require("playwright"); // ^1.30.1
const url = "<Your URL>";
let browser;
(async () => {
browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto(url, {waitUntil: "domcontentloaded"});
await page.getByText("Privacy Policy").click();
const text = await page.frameLocator("iframe")
.locator('[data-custom-class="body"]')
.textContent(); // or .innerHTML()
console.log(text.trim());
await fs.writeFile("policy.txt", text.trim());
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Now, if the goal is to get the privacy policy as quickly as possible, and you don't care about replicating user actions for testing purposes, you could navigate directly to the iframe's src
URL. Assuming that URL is stable, this is the easiest way to get to the result: no clicking or iframes required.