python google-app-engine app-engine-ndb bigtable

ndb.query.count() failed with 60s query deadline on large entities

For 100k+ entities in google datastore, ndb.query().count() is going to cancelled by deadline , even with index. I've tried with produce_cursors options but only iter() or fetch_page() will returns cursor but count() doesn't.

How can I count large entities?

Solution

To do something that expensive you should take a look on Task Queue Python API. Based on the Task Queue API, Google App Engine provides the deferred library, which we can use to simplify the whole process of running background tasks.

Here is an example of how you could use the deferred library in your app:

import logging

def count_large_query(query):
  total = query.count()
  logging.info('Total entities: %d' % total)

Then you can call the above function from within your app like:

from google.appengine.ext import deferred

# Somewhere in your request:
deferred.defer(count_large_query, ndb.query())

While I'm still not sure if the count() going to return any results with such large datastore you could use this count_large_query() function instead, which is using cursors (untested):

LIMIT = 1024
def count_large_query(query):
  cursor = None
  more = True
  total = 0
  while more:
    ndbs, cursor, more = query.fetch_page(LIMIT, start_cursor=cursor, keys_only=True)
    total += len(ndbs)

  logging.info('Total entitites: %d' % total)

To try locally the above set the LIMIT to 4 and check if in your console you can see the Total entitites: ## line.

As Guido mentioned in the comment this will not going to scale either:

This still doesn't scale (though it may postpone the problem). A task has a 10 minute instead of 1 minute, so maybe you can count 10x as many entities. But it's pretty expensive! Have a search for sharded counters if you want to solve this properly (unfortunately it's a lot of work).

So you might want to take a look on best practices for writing scalable applications and especially the sharding counters.