Search code examples
javascriptnode.jsnpmherokupuppeteer

Puppeteer only runs three times on Heroku


I'm working on a website which uses puppeteer to scrape data from another website. When I run the npm server on my local machine, it scrapes the data just fine, however when I deploy it to Heroku, it only runs the first three files I'm looking for and then stops.

I'm essentially trying to scrape data about classes from my school website, so I run this line in a for loop,

let data = await crawler.scrapeData(classesTaken[i].code)

This runs this function down below. I have replaced the actual website URL for my own privacy.

    const browser = await puppeteer.launch({
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox'
      ]
    })
    const page = await browser.newPage()
    
    await page.goto("website url")
    await page.type('#crit-keyword', code)
    await page.click('#search-button')

    await page.waitForSelector(".result__headline")

    await page.click(".result__headline")

    await page.waitForSelector("div.text:nth-child(2)")

    let data = await page.evaluate(() => {
        let classTitle = document.querySelector("div.text:nth-child(2)").textContent
            .toLowerCase().split(' ')
            .map((s) => s.charAt(0).toUpperCase() + s.substring(1)).join(' ').replace('Ii', "II")
        let classDesc =  document.querySelector(".section--description > div:nth-child(2)").textContent.replace('Lec/lab/rec.', '').trim()

        return {
            title: classTitle,
            desc: classDesc
        }
    })

    console.log(`== Finished grabbing ${code}`)

    return data

This runs perfectly fine on my own local server. However, when I push to my Heroku website, it only runs the first three class codes. I have a feeling this could be due to my dyno running out of memory, but I don't know how to make it wait for there to be available memory.

Here are the deploy logs

2023-05-22T17:29:18.421015+00:00 app[web.1]: == Finished grabbing CS 475
2023-05-22T17:29:19.098698+00:00 app[web.1]: == Finished grabbing CS 331
2023-05-22T17:29:19.783377+00:00 app[web.1]: == Finished grabbing CS 370

2023-05-22T17:29:49.992190+00:00 app[web.1]: /app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317

2023-05-22T17:29:49.992208+00:00 app[web.1]:     const timeoutError = new Errors_js_1.TimeoutError(`waiting for ${taskName} failed: timeout ${timeout}ms exceeded`);

2023-05-22T17:29:49.992209+00:00 app[web.1]:                          ^

2023-05-22T17:29:49.992209+00:00 app[web.1]: 

2023-05-22T17:29:49.992210+00:00 app[web.1]: TimeoutError: waiting for target failed: timeout 30000ms exceeded

2023-05-22T17:29:49.992211+00:00 app[web.1]:     at waitWithTimeout (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317:26)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at Browser.waitForTarget (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/Browser.js:405:56)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at ChromeLauncher.launch (/app/node_modules/puppeteer/lib/cjs/puppeteer/node/ChromeLauncher.js:100:31)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async Object.scrapeData (/app/crawler.js:9:21)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async getClassData (file:///app/server.mjs:40:16)

2023-05-22T17:29:49.992234+00:00 app[web.1]: 

I read somewhere to try clearing the build cache using these commands

$ heroku plugins:install heroku-builds
$ heroku builds:cache:purge --app your-app-name

I have tried that and it didn't do anything. I also followed the troubleshooting notes for Heroku on the puppeteer GitHub.

The reason I believe it might be something to do with my dyno memory is due to this related post. If this is the case, I would like to figure out how to wait until there is available memory to use.

EDIT: I am now running the browser in headless mode as well, this results in the exact same error.


Solution

  • Upon further logging, I discovered the issue was that I was leaking memory by opening the browser and then never closing it. By adding the line await browser.close() right before the return statement of the scrapeData() function, the memory leaks stopped and the server was able to parse all of the class codes correctly.