Search code examples
google-cloud-platformgoogle-cloud-datastore

GCP datastore: How to model data?


I am confused by datastore - especially I have trouble to decide where to store information about my objects.

E.g. I have a car, which is owned by a company. In JSON this might look like this:

{
"car_id": "car001", # only unique among a particular owner
"company": "company001",
"value": 5200 # dollars
"Type": "Truck"
}

company and Type are restricted to a few dozen values. I will frequently query by ID, company and Type. Especially it is essentially hierarchical, a company has cars of multiple types and each type has multiple actual cars in it.

I can see at least three ways to model it:

  1. Encode it in the Identifier:
 key = client.key("Car", "{company}_{type}_{car_id}")
    entity = datastore.Entity(key=key)
    entity.update({
"car_id": "car001", # only unique among a particular owner
"company": "company001",
"value": 5200 # dollars
"Type": "Truck"
})
  1. Encode it in the Parent-Keys:
 company_key = client.key("Company", "Company001") 
 type_key = client.key("Type", "Truck", parent=company_key)
 key = client.key("Car", car_id", parent=type_key)
    entity = datastore.Entity(key=key)
    entity.update({
"car_id": "car001", # only unique among a particular owner
"company": "company001",
"value": 5200 # dollars
"Type": "Truck"
})
  1. Query it:
 key = client.key("Car") # identifier is automatically assigned, kind should be Car
    entity = datastore.Entity(key=key)
    entity.update({
"car_id": "car001", # only unique among a particular owner
"company": "company001",
"value": 5200 # dollars
"Type": "Truck"
})

in the application query for the properties.

But what is best? For other NoSql dbs I know, there is usually some sort of guide how it is expected to be used (RavenDb, Cassandra etc.), but I was unable to find such a thing for datastore.


Solution

  • Datastore automatically indexes each property, so you can efficiently query by car_id, company and Type in all three of your proposed layouts.

    However there are some other reasons you may want to pick one of the solutions:

    • If you want to store per-company information, such as it's address, you should create a company entity.
    • If you want to be able to retrieve and update Company and Cars in the same transaction with strong consistency, the must have a parent/child relationship.
    • There is a limit of one update per second write for each entity group. So if you want to be able to update cars from the same company more than once per second, you should not use a parent-child relationship.
    • The best practice is to avoid lots of reads and writes to a narrow range of keys. For this reason you may prefer to have a randomly assigned ID than rely on something in your dataset that might result in skewed access patterns.