Search code examples
javascriptpuppeteerselectors-api

Pulling specific column's content from website table


I am trying to pull all passwords from a table at the website https://www.passwordrandom.com/most-popular-passwords. I am only trying to pull the second element in each td, except for the first tr. When I run the code, everything in the array turns out null.

I have tried messing around with the selector, but I'm not sure exactly what to do with it. I'm thinking maybe the arguments are wrong but not sure how it should look.

const puppeteer = require('puppeteer')
const fs = require('fs')

const baseURL = 'https://www.passwordrandom.com/most-popular-passwords'

async function scrape() {
    const browser = await puppeteer.launch()

    const page = await browser.newPage()
    console.log('Puppeteer Initialized')

    await page.goto(baseURL)

    const allNodes = await page.evaluate(() => {
        return document.querySelectorAll("#cntContent_lstMain tr:not(:first-child) td:nth-child(2)")
    })

    const allWords = []

    for (let row in allNodes)
        allWords.push(allNodes[row].textContent)

    console.log(allWords)

    await browser.close();
}

scrape()

Essentially, the result should be an array containing every single password in the table. The passwords are help in the second element in each td except for the first tr (like I stated above).


Solution

  • The code inside page.evaluate runs inside the browser, the code outside runs on node.

    When you return the elements using document.querySelectorAll, it returns a NodeList, which is then serialized and the data is lost (or referenced differently) due to serialization. Which means, allNodes[row].textContent will not work.

    The easiest way is to return the data from inside the page.evaluate.

    const allNodes = await page.evaluate(() => {
      const elements = [...document.querySelectorAll("#cntContent_lstMain tr:not(:first-child) td:nth-child(2)")]
      return elements.map(element=>element.textContent)
    })
    

    It will give you the textContent for all of the available elements with that selector.