Search code examples
pythongoogle-cloud-datastoreapp-engine-ndb

Google Cloud Datastore Indexes for count queries


Google cloud datastore mandates that there needs to be composite indexes built to query on multiple fields of one kind. Taking the following query for example,

class Greeting(ndb.Model):
    user = ndb.StringProperty()
    place = ndb.StringProperty()

# Query 1
Greeting.query(Greeting.user == '[email protected]', Greeting.place == 'London').fetch()
# Query 2
Greeting.query(Greeting.user == '[email protected]', Greeting.place == 'London').count()

I am using python with ndb to access cloud datastore. In the above example, Query 1 raises NeedIndexError if there is no composite index defined on user and place. But Query 2 works fine even if there is no index on user and place.

I would like to understand how cloud datastore fetches the count (Query 2) without the index when it mandates the index for fetching the list of entities (Query 1). I understand it stores Stats per kind per index which would result in quicker response for counts on existing indexes (Refer docs). But I'm unable to explain the above behaviour.

Note: There is no issue when querying on one property of a given kind as cloud datastore has indexes on a single properties by default.


Solution

  • There is no clear & direct explanation on why this happens but most likely its because how improved query planner works with zigzag indexes.

    You can read more about this here: https://cloud.google.com/appengine/articles/indexselection#Improved_Query_Planner

    The logic behind count() working and fetch() does not probably because with count() you don't need to keep in memory a lot of results.

    So in case of count() you can easily scale by splitting work in multiple chunks processed in parallel and then just sum corresponding counts into one. You can't do this cheaply with cursors/recordsets.