Search code examples
pythonweb-scrapingheroku

Error H14 heroku with selenium and fastapi using python


I have a Fastapi with python that does some kind of web scraping. The api does the scraping part correctly and I'm sure of that by testing, but it shows this error when I visit the api page:

2022-07-08T09:15:12.564152+00:00 app[worker.1]: INFO: Started server process [4]
2022-07-08T09:15:12.564200+00:00 app[worker.1]: INFO: Waiting for application startup.
2022-07-08T09:15:12.564650+00:00 app[worker.1]: INFO: Application startup complete.
2022-07-08T09:15:12.565232+00:00 app[worker.1]: INFO: Uvicorn running on http://0.0.0.0:47436 (Press CTRL+C to quit)
2022-07-08T09:16:05.643153+00:00 heroku[router]: at=error code=H14 desc="No web processes running" method=GET path="/" host=cryptic-plateau-86689.herokuapp.com request_id=504c098c-a538-418b-898c-70ed38496780 fwd="156.146.59.25" dyno= connect= service= status=503 bytes= protocol=https

Here's a small snippet of my script

dict = Scraping().get_books() # this is the web scraping part
app = FastAPI()
@ app.get("/")
def home():
      """Gets everything"""
      return dict

And here's my Procfile:

worker: uvicorn main:app --host=0.0.0.0 --port=${PORT:-5000}

Notice that I tried using web instead of worker but I then get another error

 Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch 

Notice that Scraping().get_books() takes a long time (2-5 minutes) that's why I think it causes a timeout when using web.

Please keep in mind that I'm a beginner and here's how I think about: I think worker can do the web scraping part but can't handle the api part. On the other hand, web can handle the api part but can't do web scraping. Is this theory correct? If yes, how can I use both web and worker at the same time for different tasks?


Solution

  • You've partly answered the question yourself: the reason you are getting the heroku error is that you are not defining a web process, which you need to use to expose a web API. You've also given the reason you're getting the other error when you do use the web process: the first line (dict = Scraping().get_books()) takes 2-5 minutes to run, meaning we are "stuck" waiting on the first line, so the actual FastAPI app doesn't start until after those 2-5 minutes, and heroku has a 1 minute timeout for starting the API.

    Also, sidenote: dict is a reserved keyword in Python, so you really shouldn't use it for variable names. Try do find a more descriptive name, e.g. book_dict.

    So what can be done to fix this? First, how often should the scraping run? Currently, you're running it once (when starting the application), and then it's fixed until you restart it. That seems a bit weird to me (as in that case, you could run the scraping part once, save that to a json file, and then just read that and return it). So, I'll assume you want to refresh it at least sometimes. You could run Scraping().get_books() inside your home method, but it's usually considered bad practice to have HTTP transactions longer than a couple seconds, and will often lead to timeouts.

    So first, I would consider why the scraping job is taking so long. Are you going through a lot of pages, and if so could you split it up into a function that takes a range of pages? Or maybe, see if you can get the data directly from the underlying API instead of using selenium (see e.g. this video).

    But if it's not possible to speed up the scraping, there are a couple of ways to deal with "long running transactions" There is no cut and dry solution to this, but I'll offer some suggestions:

    1. The simplest solution might be to run the web scraping in the same process, but in a background thread:

      import asyncio
      
      book_dict = {}
      
      async def refresh_books():
          global book_dict 
          book_dict = await asyncio.to_thread(Scraping().get_books)
      

      You could then run this say a couple times a day or so using the fastapi-utils library's @repeat_every.

    2. Look at a task queue library, e.g. Celery. This is a more robust way of running background tasks (but also more complex), where you'd have both a web and a worker process, with a message queue to communicate (I would suggest RabbitMQ on heroku). Then you'd have one API method that starts the scraping task and immediately returns the task ID, and then another method to get the task result using that ID. Here, you could also schedule that job to run a couple times per day if you wanted.


    EDIT: You mentioned exposing a separate endpoint to scrape the data and save it to a JSON file. In this case, I would opt for using FastAPI's built in BackgroundTask, where you create a background task which runs the scraping and then saves the json file. Something along the lines of:

    import json
    
    from fastapi import BackgroundTasks, FastAPI, status
    
    
    def scrape_books():
        book_dict = Scraping().get_books()
        with open("books.json", "w") as f:
            json.dump(book_dict, f)
    
    
    @app.post("/update-data", status_code=status.HTTP_202_ACCEPTED)
    def send_notification(background_tasks: BackgroundTasks):
        background_tasks.add_task(scrape_books)
        return {"message": "Scraping books in the background"}