Search code examples
javascriptxpathpuppeteer

How to find links children of any level


I have the following piece of html inside a page I loaded using puppeteer and I'm trying to get all the child links (not just direct children, child as at any level).

<ul class="ptf">
    <li class="pti">
        <div data-testid="pagetree-item-expander" class="pe" role="button" tabindex="0" aria-expanded="false"></div>
       <a href="/jsw/docs/start” data-testid="atlas_link">kl</a>
       <ul class="ptf" style="display:none">
            <li class="pti">
                <a href="/jsw/docs/what/" data-testid="atlas_link">ij</a>
            </li>
            <li class="pti">
                <a href="/jsw/docs/where/" data-testid="atlas_link">gh</a>
            </li>
            <li class="pti">
                <a href="/jsw/docs/common/" data-testid="atlas_link">ef</a>
            </li>

        </ul>
     <li class="pti">
         <div data-testid="pagetree-item-expander" class="pe" role="button" tabindex="0" aria-expanded="false"></div>
       <a href="/jsw/docs/ge/" data-testid="atlas_link">cd</a>
       <ul class="ptf" style="display:none">
            <li class="pti">
                <a href="/jsw/docs/wha/" data-testid="atlas_link">ab</a>
            </li>
      </li>
</ul>

I tried the following but it's not listing down any children. What am I doing wrong?

const links = await page.$x("//*[@id=\"root\"]/div[2]/div/li[5]/ul//a");
  
  for (let i = 0; i < links.length; i++) {
    
    const textContent = await links[i].getProperty("href");
    const srcText = await textContent.jsonValue();
    console.log(srcText);
  }

Context: I'm looking to get URLs of all child links within this link: enter image description here

Expected outcome: A flat array with the following first 10 URLs:

[“https://support.atlassian.com/jira-software-cloud/docs/get-started-with-advanced-roadmaps/“, 
“https://support.atlassian.com/jira-software-cloud/docs/what-is-advanced-roadmaps/“, 
“https://support.atlassian.com/jira-software-cloud/docs/where-do-i-find-advanced-roadmaps/“, 
“https://support.atlassian.com/jira-software-cloud/docs/common-jira-software-configurations-for-advanced-roadmaps/“, “https://support.atlassian.com/jira-software-cloud/docs/view-a-sample-advanced-roadmaps-plan/“, 
“https://support.atlassian.com/jira-software-cloud/docs/create-a-new-plan-in-advanced-roadmaps/“, 
“https://support.atlassian.com/jira-software-cloud/docs/how-do-i-navigate-advanced-roadmaps/“, 
“https://support.atlassian.com/jira-software-cloud/docs/change-your-advanced-roadmaps-plan-settings/“, “https://support.atlassian.com/jira-software-cloud/docs/how-do-i-read-my-advanced-roadmaps-plan/“, “https://support.atlassian.com/jira-software-cloud/docs/what-do-the-symbols-in-advanced-roadmaps-mean/“]

Solution

  • This appears to be an XY problem. The data is in the page source as a JSON string, so you can get it without any dependencies or imports by using Node 18's native fetch:

    fetch("<Your URL>")
      .then(res => {
        if (!res.ok) {
          throw Error(res.statusText);
        }
    
        return res.text();
      })
      .then(html => {
        const pageTree = JSON.parse(
          html.match(/^ *pageTree: (.*);*$/m)[1]
        );
        console.log(JSON.stringify(pageTree, null, 2));
        const hrefs = pageTree
          .find(({title}) =>
            title.toLowerCase().includes("advanced roadmaps")
          )
          .childList[0].childList.map(({slug}) => slug);
        console.log(hrefs);
      })
      .catch(err => console.error(err));
    

    Output:

    <giant JSON structure with the entire nav tree>
    [
      '/jira-software-cloud/docs/what-is-advanced-roadmaps/',
      '/jira-software-cloud/docs/where-do-i-find-advanced-roadmaps/',
      '/jira-software-cloud/docs/common-jira-software-configurations-for-advanced-roadmaps/',
      '/jira-software-cloud/docs/view-a-sample-advanced-roadmaps-plan/',
      '/jira-software-cloud/docs/create-a-new-plan-in-advanced-roadmaps/',
      '/jira-software-cloud/docs/how-do-i-navigate-advanced-roadmaps/',
      '/jira-software-cloud/docs/change-your-advanced-roadmaps-plan-settings/',
      '/jira-software-cloud/docs/how-do-i-read-my-advanced-roadmaps-plan/',
      '/jira-software-cloud/docs/what-do-the-symbols-in-advanced-roadmaps-mean/',
      '/jira-software-cloud/docs/what-keyboard-shortcuts-are-available-in-advanced-roadmaps/',
      '/jira-software-cloud/docs/add-teams-and-releases-to-your-advanced-roadmaps-plan/',
      '/jira-software-cloud/docs/build-out-your-plan-in-advanced-roadmaps/',
      '/jira-software-cloud/docs/planning-tools-in-advanced-roadmaps/',
      '/jira-software-cloud/docs/create-different-views-of-your-advanced-roadmaps-plan/',
      '/jira-software-cloud/docs/how-ted-uses-advanced-roadmaps-scenarios-and-capacity/',
      '/jira-software-cloud/docs/how-veronica-uses-advanced-roadmaps-cross-project-planning/'
    ]
    

    This runs in a fraction of the time Puppeteer would take, 0.879s on my decade-old laptop. Although it's possible the JSON format could change at any time, it's just as likely that the DOM could as well.

    See this answer for a detailed walkthrough of how to find your data like this. It's written in Python but all of the concepts apply to Node.

    If your requests are being blocked (and you added a user agent header), or for some reason you really want/need to use Puppeteer, the data in question is attached to the window, so you can use:

    const puppeteer = require("puppeteer"); // ^20.2.0
    
    const url = "<Your URL>";
    
    let browser;
    (async () => {
      browser = await puppeteer.launch();
      const [page] = await browser.pages();
      await page.setRequestInterception(true);
      page.on("request", req => {
        req.url().replace(/\/$/, "") === url.replace(/\/$/, "")
          ? req.continue()
          : req.abort();
      });
      await page.goto(url, {waitUntil: "domcontentloaded"});
      const hrefs = await page.evaluate(() =>
        window.__APP_INITIAL_STATE__.pageTree
          .at(-1)
          .childList[0].childList.map(({slug}) => slug)
      );
      console.log(hrefs);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    This took 3-4x as long to run as the fetch version for me.

    Disclosure: I'm the author of the linked blog post.