For 100k+ entities in google datastore, ndb.query().count() is going to cancelled by deadline , even with index. I've tried with produce_cursors options but only iter() or fetch_page() will returns cursor but count() doesn't.
How can I count large entities?
To do something that expensive you should take a look on Task Queue Python API. Based on the Task Queue API, Google App Engine provides the deferred library, which we can use to simplify the whole process of running background tasks.
Here is an example of how you could use the deferred library in your app:
import logging
def count_large_query(query):
total = query.count()
logging.info('Total entities: %d' % total)
Then you can call the above function from within your app like:
from google.appengine.ext import deferred
# Somewhere in your request:
deferred.defer(count_large_query, ndb.query())
While I'm still not sure if the count()
going to return any results with such large datastore you could use this count_large_query()
function instead, which is using cursors (untested):
LIMIT = 1024
def count_large_query(query):
cursor = None
more = True
total = 0
while more:
ndbs, cursor, more = query.fetch_page(LIMIT, start_cursor=cursor, keys_only=True)
total += len(ndbs)
logging.info('Total entitites: %d' % total)
To try locally the above set the LIMIT
to 4 and check if in your console you can see the Total entitites: ##
line.
As Guido mentioned in the comment this will not going to scale either:
This still doesn't scale (though it may postpone the problem). A task has a 10 minute instead of 1 minute, so maybe you can count 10x as many entities. But it's pretty expensive! Have a search for sharded counters if you want to solve this properly (unfortunately it's a lot of work).
So you might want to take a look on best practices for writing scalable applications and especially the sharding counters.