Search code examples
google-app-enginegoogle-cloud-datastoreapp-engine-ndbtask-queue

ndb data contention getting worse and worse


I have a bit of a strange problem. I have a module running on gae that puts a whole lot of little tasks on the default task queue. The tasks access the same ndb module. Each task accesses a bunch of data from a few different tables then calls put.

The first few tasks work fine but as time continues I start getting these on the final put:

suspended generator _put_tasklet(context.py:358) raised TransactionFailedError(too much contention on these datastore entities. please try again.)

So I wrapped the put with a try and put in a randomised timeout so it retries a couple of times. This mitigated the problem a little, it just happens later on.

Here is some pseudocode for my task:

def my_task(request):
    stuff = get_ndb_instances() #this accessed a few things from different tables
    better_stuff = process(ndb_instances) #pretty much just a summation
    try_put(better_stuff)
    return {'status':'Groovy'}

def try_put(oInstance,iCountdown=10):
    if iCountdown<1:
        return oInstance.put()
    try:
        return oInstance.put()
    except:
        import time
        import random 
        logger.info("sleeping")
        time.sleep(random.random()*20)
        return oInstance.try_put(iCountdown-1)

Without using try_put the queue gets about 30% of the way through until it stops working. With the try_put it gets further, like 60%.

Could it be that a task is holding onto ndb connections after it has completed somehow? I'm not making explicit use of transactions.

EDIT:

there seems to be some confusion about what I'm asking. The question is: Why does ndb contention get worse as time goes on. I have a whole lot of tasks running simultaneously and they access the ndb in a way that can cause contention. If contention is detected then a randomy timed retry happens and this eliminates contention perfectly well. For a little while. Tasks keep running and completing and the more that successfully return the more contention happens. Even though the processes using the contended upon data should be finished. Is there something going on that's holding onto datastore handles that shouldn't be? What's going on?

EDIT2:

Here is a little bit about the key structures in play:

My ndb models sit in a hierarchy where we have something like this (the direction of the arrows specifies parent child relationships, ie: Type has a bunch of child Instances etc)

Type->Instance->Position

The ids of the Positions are limited to a few different names, there are many thousands of instances and not many types.

I calculate a bunch of Positions and then do a try_put_multi (similar to try_put in an obvious way) and get contention. I'm going to run the code again pretty soon and get a full traceback to include here.


Solution

  • Contention will get worse overtime if you continually exceed the 1 write/transaction per entity group per second. The answer is in how Megastore/Paxo work and how Cloud Datastore handles contention in the backend.

    When 2 writes are attempted at the same time on different nodes in Megastore, one transaction will win and the other will fail. Cloud Datastore detects this contention and will retry the failed transaction several times. Usually this results in the transaction succeeding without any errors being raised to the client.

    If sustained writes above the recommended limit are being attempted, the chance that a transaction needs to be retried multiple times increases. The number of transactions in an internal retry state also increases. Eventually, transactions will start reaching our internal retry limit and will return a contention error to the client.

    Randomized sleep method is an incorrect way to handle error response situations. You should instead look into exponential back-off with jitter (example).

    Similarly, the core of your problem is a high write rate into a single entity group. you should look into whether the explicit parenting is required (removing it if not), or if you should shard the entity group in some manner that makes sense according to your queries and consistency requirements.