Search code examples
google-cloud-platformgoogle-cloud-datastoreapp-engine-ndb

Google Datastore ancestor query returning data too far down


I have an "inbox/messaging" structure that I'm working on, that allows for multiple kinds of parents. As in, people can leave comments on a few different kinds of objects. For this example, let's say someone is leaving a comment on a Article object.

The way we've formatted our data, the comment is created as a Message object, and that object is a child of Article (and Article is a child of Account). So when we query for the list of messages, we simply ask for all Messages that are children of that instance of Article. That looks like this:

Message.query(ancestor=source_key)

source_key here is the Key of the article we're viewing.

Great, this works really well and is pretty fast.

Now we want to add replies to those Message objects. I figure we'll just store replies the same way we add Messages to Articles. Which is to say, a Reply is simply another instance of Message, and the parent of that reply object is the message it's replying to. So basically, instead of leaving a comment on an Article, you're leaving a comment on a Message.

This sounds good on paper but it seems that in practice, the Key it ends up getting is structured like so:

Key('Account', 5629499534213120, 'Article', 5946158883012608, 'Message', 6509108836433920)

Which turns out, when we query for the list of messages, it return the replies as well in the response, as if they aren't replies at all.

Some questions:

  • Is there any way we can do like a "shallow" query? To strictly get only the immediate children of that Article?

  • I've read more on how ancestor queries work and because ancestor queries have a 1 write per second limitation, I'm now wondering if it may be better to change how we store this to where a Message is not the child of Article, and instead maybe have a KeyProperty of Article exist on Message, if that makes sense. And maybe no parent for Message. There could be lots of people leaving a comment on an article, or also lots of people leaving replies to those comments. But even so, Article is a child of Account too, along with a lot of other kinds of objects, and generally we don't run into any issues with lots of different writes. So would we even run into this write limitation?

EDIT: I've moved on a little bit and am trying to query only replies for a given message, so I'm looking for all messages that have a parent (ancestor) of another message.

So given this key as the ancestor: Key('Account', 5629499534213120, 'Article', 5946158883012608, 'Message', 5663034638860288)

I query our message table, and I get back that exact same key (as well as other messages). How is that possible? If I'm specifying an ancestor, in what world does it make sense that I would get back the same object I'm using to query the ancestor with? The parent of that message is just:

Key('Account', 5629499534213120, 'Article', 5946158883012608)

So, obviously the ancestor doesn't strictly match there. Why would my query return it then? Hastebin of what, basically, I'm running into: https://hastebin.com/karojolisi.py


Solution

  • Regarding the question on write limitation, if you are using the Cloud Firestore in Datastore mode, then the limitation of 1 write per second is by entity and not entity group.

    See https://cloud.google.com/datastore/docs/firestore-or-datastore

    "Writes to an entity group are no longer limited to 1 per second."

    and https://cloud.google.com/datastore/docs/concepts/limits

    "Maximum write rate to an entity" is "1 per sec"

    So, irrespective of which approach you take, with datastore mode, writes shouldn't be a concern as the messages and replies are not expected to be edited. Unless of course, if you have any kind of aggregate information like the number of replies for a given message which require updating the parent message record with each child record.

    Regarding your main question of querying only the messages for an article and not their replies, one option is to have a field called article_id and populate this only for the top level messages and have this also in the index (prefix of the ancestor composite index). The reason to recommend article_id and not a boolean is, since this field is indexed, it is better to have the field not be based on a narrow range of values.

    The reason to prefer this approach to storing the messages in a separate table is that all messages belonging to an article will be stored close by with the initial approach and that is better for read performance.