Search code examples
google-cloud-spanner

When forcing a query on a non-interleaved index(s), will there be any data locality benefit?


Assuming the following schema:

CREATE TABLE Foo (
    primaryId STRING(64) NOT NULL,
    secondaryId STRING(64) NOT NULL,
    extraData STRING(80),
    active BOOL NOT NULL
) PRIMARY KEY (primaryId, secondaryId);

CREATE TABLE Bar (
    primaryId STRING(64) NOT NULL,
    secondaryId STRING(64) NOT NULL,
    barId STRING(64) NOT NULL
) PRIMARY KEY (primaryId, secondaryId, barId),
INTERLEAVE IN PARENT Foo ON DELETE CASCADE;

CREATE TABLE Baz (
    primaryId STRING(64) NOT NULL,
    secondaryId STRING(64) NOT NULL,
    barId STRING(64) NOT NULL,
    bazId STRING(64) NOT NULL,
    extraData STRING(80)
) PRIMARY KEY (primaryId, secondaryId, barId, bazId),
INTERLEAVE IN PARENT Bar ON DELETE CASCADE;

CREATE INDEX foo_primaryId_active ON foo (primaryId, active);
CREATE INDEX baz_bazId ON Baz (bazId);

We have 3 tables Foo, Bar, Baz, where Bar is interleaved in Foo and Baz is interleaved in Bar. Along with 2 non-interleaved indexes.

Given the following query where we force the FROM and the JOIN onto the indexes; no explicit tables.

SELECT
    baz.primaryId, 
    baz.secondaryId, 
    baz.bazId, 
    baz.extraData
FROM
    Baz@{FORCE_INDEX=baz_bazId} AS baz
JOIN
    Foo@{FORCE_INDEX=foo_secondaryId_isActive} AS foo
ON
    foo.primaryId = baz.parimaryId AND foo.secondaryId = baz.secondaryId
WHERE
    baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
    foo.active = true

Is there a data locality benefit in this query when forcing the indexes? If we were to later add a 4th table Zap and interleave that table on Foo:

CREATE TABLE Zap (
    primaryId STRING(64) NOT NULL,
    secondaryId STRING(64) NOT NULL,
    bazId STRING(64) NOT NULL,
    extraData STRING(80)
) PRIMARY KEY (primaryId, secondaryId, bazId),
INTERLEAVE IN PARENT Foo ON DELETE CASCADE;

CREATE INDEX zap_bazId ON Zap (bazId);

And adjust the above query to include a 3rd JOIN:

JOIN
    Zap@{FORCE_INDEX=zap_bazId} AS zap
ON 
    zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
    baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
    foo.active = true
    zap.extraData IS NULL

Would we obtain any data locality benefit here? Since we are querying on all non-interleaved indexes. Our zap.extraData IS NULL predicate is not stored on in the index itself, so that would likely need to run back to the Zap table to check.

If there is no data locality benefit from querying on non-interleaved indexes, could we just forgo that extra zap_bazId index and just alter the Zap table since we know we will be exclusively querying on the bazId for the data it hosts:

CREATE TABLE Zap (
    bazId STRING(64) NOT NULL,
    primaryId STRING(64) NOT NULL,
    secondaryId STRING(64) NOT NULL,
    extraData STRING(80)
) PRIMARY KEY (bazId, primaryId);

the amended query then becomes

JOIN
    Zap AS zap -- using a table; aka the implicit PRIMARY_KEY index
ON 
    zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
    baz.bazId = @bazId AND -- using the baz_bazId index to query on the bazId
    foo.active = true AND
    zap.extraData IS NULL

Now, we lose the CASCADE DELETE here, so it might still just be worth it to interleave and create that additional index and STORING the zap.extraData into the index to save it from having to go back to the Zap table to pull that information.

The question would still be: Does data locality even come into play when querying/joining exclusively on non-interleaved indexes?


Solution

  • As I understand from the documentation, if the indexes are not interleaved and you query/join by the indexes, data locality doesn't matter. If you are intending to query using the indexes you should only need to interleave the indexes.

    Regardless, as you have explained you can keep using the interleaving on your tables if you are interested in the ON DELETE CASCADE statement, as it cannot be done without interleaving.

    Clarifications:

    Given a table with columns primaryId and secondaryId where the primary key of the table is primaryId. Creating a secondary index on secondaryId excludes it from being interleaved into the table.

    Yes.

    If the indexes are not interleaved, there is no data locality at play

    Depends on the query. Joins between a non-interleaved index and a base table are not local. You should consider the STORING clause in the index to avoid the join. Joins between a table and its parent would be local.

    The query explanation dashboard is a useful tool to show how Cloud Spanner executes specific queries. Using it we can analyze the aforementioned queries.

    • There's a distributed join between baz_bazId and Baz and another distributed join with foo_primaryId_active.

      SELECT
          baz.primaryId,
          baz.secondaryId,
          baz.bazId,
          baz.extraData
      FROM
          Baz@{FORCE_INDEX=baz_bazId} AS baz
      JOIN
          Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
      ON
          foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
      WHERE
          baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
          AND foo.active = true
      

      First query

    • There's an added distributed join between Zap and zap_bazid which is distributed joined with the rest.

      SELECT
          baz.primaryId,
          baz.secondaryId,
          baz.bazId,
          baz.extraData
      FROM
          Baz@{FORCE_INDEX=baz_bazId} AS baz
      JOIN
          Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
      ON
          foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
      JOIN
          Zap@{FORCE_INDEX=zap_bazId} AS zap
      ON
          zap.bazId = @bazId AND zap.primaryId = foo.primaryId
      WHERE
          baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
          AND foo.active = true
          AND zap.extraData IS NULL
      

      Second query

    • It uses table Zap2 (the non-interleaved version of Zap) instead of needing the distributed join between Zap and zap_bazid in the second query.

      SELECT
          baz.primaryId,
          baz.secondaryId,
          baz.bazId,
          baz.extraData
      FROM
          Baz@{FORCE_INDEX=baz_bazId} AS baz
      JOIN
          Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
      ON
          foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
      JOIN
          Zap2 AS zap -- using a table; aka the implicit PRIMARY_KEY index
      ON
          zap.bazId = @bazId AND zap.primaryId = foo.primaryId
      WHERE
          baz.bazId = @bazId AND -- using the baz_bazId index to query on the bazId
          foo.active = true AND
          zap.extraData IS NULL
      

      Third query

    Spanner will handle all the related network I/O re: the data splits.

    Yes.

    If indexes can be interleaved there would be a benefit but the keys in those interleaved indexes have to be shared (like with any interleaved table). The docs for locality tradeoffs: "Focus on getting the desired locality for the most important root entities and most common access patterns, and let less frequent or less performance sensitive distributed operations happen when they need to."

    Yes.