Search code examples
pythonherokuscrapy

Scrapy on Heroku


I've just upgraded my working scrapy app hosted on heroku to Build Pack 20. I'm now getting an error in my logs that happens before my scraping application finishes.

Logs:

2021-08-25T14:15:49.867725+00:00 app[api]: Starting process with command `scrapy crawl main` by user [email protected]
2021-08-25T14:15:57.812969+00:00 heroku[run.7197]: State changed from starting to up
2021-08-25T14:15:57.758336+00:00 heroku[run.7197]: Awaiting client
2021-08-25T14:15:57.776747+00:00 heroku[run.7197]: Starting process with command `scrapy crawl main`
2021-08-25T14:37:11.126653+00:00 heroku[run.7197]: Client connection closed. Sending SIGHUP to all processes
2021-08-25T14:37:11.650022+00:00 heroku[run.7197]: Process exited with status 129
2021-08-25T14:37:11.850624+00:00 heroku[run.7197]: State changed from up to complete

I believe my problem probably relates to a Dyno limit on Heroku related to attached one off dyno limits which issue a timout reset. I'm not sure if this resets the dyno or just the shell terminal. https://devcenter.heroku.com/articles/limits#dynos

Do I need to change something in my code to refresh the timeout counter using a "keep-alive" strategy?

Edit: From the Heroku shell, I did see that the spider was working perfectly for about an hour (a few hundred items scraped) and then the shell session ended without any notice or error message. So, I assume this was the "SIGHUP" interruption sent by the dyno?


Solution

  • I solved my problem. I'm passing this on in case anyone else runs into it as it was clearly a "rookie" mistake.

    I was trying to run my app using the web console that allows a bash command from the browser. this is "run console" from the "more" dropdown in upper right corner.

    Apparently the SIGHUP is a signal sent from the "run console shell" that times out after an hour. My app exited with exit code 129 rather than expected exit 0.

    If I run the app from the CLI using:

    heroku run [my start command]
    

    It runs all the way to completion and I get full logs and stdout from the CLI.