I am running an application on gae flexible with python and flask. I periodically dispatch cloud tasks with a cron job. These basically loop through all users and perform some cluster analysis. The tasks terminate without throwing any kind of error but don't perform all the work (meaning not all users were looped through). It doesn't seem to happen at a consistent time 276.5s - 323.3s nor does it ever stop at the same user. Has anybody experienced anything similar?
My guess is that I am breaching some type of resource limit or timeout somewhere. Things i have thought about or tried:
Cloud tasks should be allowed to run for up to an hour (as per this: https://cloud.google.com/tasks/docs/creating-appengine-handlers)
I increased the timeout of gunicorn workers to be 3600 to reflect this.
I have several workers running.
I tried to find if there are memory spikes or cpu overload but didn't see anything suspicious.
Sorry if I am too vague or am completely missing the point, I am quite confused with this problem. Thank you for any pointers.
Thank you for all the suggestions, I played around with them and have found out the root cause, although by accident reading firestore documentation. I had no indication that this had anything to do with firestore.
From here: https://googleapis.dev/python/firestore/latest/collection.html I found out that Query.stream() (or Query.get()) has a timeout on the individual documents like so:
Note: The underlying stream of responses will time out after the max_rpc_timeout_millis value set in the GAPIC client configuration for the RunQuery API. Snapshots not consumed from the iterator before that point will be lost.
So what eventually timed out was the query of all users, I came across this by chance, none of the errors I caught pointed me back towards the query. Hope this helps someone in the future!