python performance google-cloud-platform fastapi

fastAPI background task takes up to 100 times longer to execute than calling function directly

I have simple fastAPI endpoint deployed on Google Cloud Run. I wrote the Workflow class myself. When the Workflow instance is executed, some steps happen, e.g., the files are processed and the result are put in a vectorstore database.

Usually, this takes a few seconds per file like this:

from .workflow import Workflow
...

@app.post('/execute_workflow_directly')
async def execute_workflow_directly(request: Request)
    ...  # get files from request object
    workflow = Workflow.get_simple_workflow(files=files)
    workflow.execute()
    return JSONResponse(status_code=200, content={'message': 'Successfully processed files'})

Now, if many files are involved, this might take a while, and I don't want to let the caller of the endpoint wait, so I want to run the workflow execution in the background like this:

from .workflow import Workflow
from fastapi import BackgroundTasks
...

def run_workflow_in_background(workflow: Workflow):
    workflow.execute()

@app.post('/execute_workflow_in_background')
async def execute_workflow_in_background(request: Request, background_tasks: BackgroundTasks):
    ...  # get files from request object
    workflow = Workflow.get_simple_workflow(files=files)
    background_tasks.add_task(run_workflow_in_background, workflow)
    return JSONResponse(status_code=202, content={'message': 'File processing started'})

Testing this with still only one file, I already run into a problem: Locally, it works fine, but when I deploy it to my Google Cloud Run service, execution time goes through the roof: In one example, background execution it took almost ~500s until I saw the result in the database, compared to ~5s when executing the workflow directly.

I already tried to increase the number of CPU cores to 4 and subsequently the number of gunicorn workers to 4 as well. Not sure if that makes much sense, but it did not decrease the execution times.

Can I solve this problem by allocating more resources to Google Cloud run somehow or is my approach flawed and I'm doing something wrong or should already switch to something more sophisticated like Celery?

Edit (not really relevant to the problem I had, see accepted answer):

I read the accepted answer to this question and it helped clarify some things, but doesn't really answer my question why there is such a big difference in execution time between running directly vs. as a background task. Both versions call the CPU-intensive workflow.execute() asynchronously if I'm not mistaken.

I can't really change the endpoint's definition to def, because I am awaiting other code inside.

I tried changing the background function to

async def run_workflow_in_background(workflow: Workflow):
    await run_in_threadpool(workflow.execute)

and

async def run_workflow_in_background(workflow: Workflow):
    loop = asyncio.get_running_loop()
    with concurrent.futures.ThreadPoolExecutor() as pool:
        res = await loop.run_in_executor(pool, workflow.execute)

and

async def run_workflow_in_background(workflow: Workflow):
    res = await asyncio.to_thread(workflow.execute)

and

async def run_workflow_in_background(workflow: Workflow):
    loop = asyncio.get_running_loop()
    with concurrent.futures.ProcessPoolExecutor() as pool:
        res = await loop.run_in_executor(pool, workflow.execute)

as suggested and it didn't help.

I tried increasing the number of workers as suggested and it didn't help.

I guess I will look into switching to Celery, but still eager to understand why it works so slowly with fastAPI background tasks.

Solution

With Cloud Function, like Cloud Run, the CPU is allocated (and billed) only when a request is processed.

A request is considered being processed between the reception of the request and the sending of the response.

The rest of the time, the CPU is throttled (below 5%).

That's being said, look back to your functions.

The fastest one get the data, process the data, and send the response. The CPU is allocated full time during the processing.
The slowest one get the data, run a task in background (multi thread, fork or whatever) and send the response immediately. After the response sent, the CPU is throttled, and the processing begin. Of course, it is very slow, you are out of the CPU allocation boundaries.

To solve that, you can use Cloud Run with the option, CPU Always allocated (or no-cpu-throttling with the GCLOUD command line). There is no option with Cloud Functions