database-design primary-key google-cloud-spanner

Optimizing Cloud Spanner for Read Latency

I designed a table schema for my Spanner database that takes advantage of interleaved tables to store associated records. I did this to optimize read times. My schema looks something like:

CREATE TABLE all_ids (
  userid INT64 NOT NULL,
) PRIMARY KEY(userid);

CREATE TABLE connections_in (
  userid INT64 NOT NULL,
  child_userid INT64 NOT NULL,
  connect_time TIMESTAMP NOT NULL,
) PRIMARY KEY(userid, child_userid),
  INTERLEAVE IN PARENT all_ids ON DELETE CASCADE;

CREATE TABLE connections_out (
  userid INT64 NOT NULL,
  child_userid INT64 NOT NULL,
  connect_time TIMESTAMP NOT NULL,
) PRIMARY KEY(userid, child_userid),
  INTERLEAVE IN PARENT all_ids ON DELETE CASCADE;

The connections_in and connections_out tables store graph connections between nodes in a user database. Each user may have at most several hundred connections.

I also want to store the metadata history of each user over time and I'm not sure if it makes sense to interleave that data into the parent table all_ids or not. Each user has several thousand rows of metadata through time:

CREATE TABLE user_metadata (
  userid INT64 NOT NULL,
  snapshot_time TIMESTAMP NOT NULL,
  username STRING(1024) NOT NULL,
  user_description STRING(1024) NOT NULL,
) PRIMARY KEY(userid, snapshot_time DESC),
  INTERLEAVE IN PARENT all_ids ON DELETE CASCADE;

I know that spanner uses lexicographic order to order primary keys, so does that mean that a row from the user_metadata table could be physically collocated next to a row from connections_in or connections_out since the first part of both primary keys is taken from the table all_ids? And if so, does that mean read times from connections_in/_out based on userid would be slower than if I created a separate non-interleaved table in the same database for user_metadata like?:

CREATE TABLE user_metadata (
  userid INT64 NOT NULL,
  snapshot_time TIMESTAMP NOT NULL,
  username STRING(1024) NOT NULL,
  user_description STRING(1024) NOT NULL,
) PRIMARY KEY(userid, snapshot_time DESC);

Any help is greatly appreciated!

Solution

Spanner uses interleaving to collocate data within the same split, therefore to assign related data to the same compute node for processing and join purposes. It's not really a row-by-row mechanism; inserting a user_metadata record won't necessarily cause the all_ids record to move a disk sector further away from the connections_out record or anything like that.

At the level of splits, there can be a performance impact, but the answer is "it depends." There are two scenarios:

userid is high-cardinality

Let's say you have 100 Spanner nodes and 1,000,000 distinct userids (and usage per user is at least roughly uniform). In this case, Spanner generally won't need to look to the second key in the key list in order to uniformly distribute the data reasonably uniformly around the cluster, and will tend to try to keep things clustered by just the first key (userid) for performance and simplicity. So all data for each user will be stored together, even if you have a bunch of interleaved tables.

userid is low-cardinality

Let's say you have 100 Spanner nodes and 8 users. But those users each have lots and lots of connection_in, connection_out, and/or user_metadata records. In this case, Spanner will want to split up the data for each user across multiple Spanner nodes so that it can take full advantage of the 100-node cluster. In this case, joining all of the data for a single user back together would require data from several Spanner nodes -- that's a distributed JOIN which is (generally speaking) more expensive than having all the data local to a single node.

If you have a large number of users but one of those users has dramatically more requests/sec spread across its different child records than others, Spanner may decide to split up the records associated with just that one extra-large userid. This is not a common case, but it can happen if you have serious data- and load-skew as a means to avoid overloading a single Spanner node handling the workload for that one user and causing an even larger latency spike.