Assuming the following schema:
CREATE TABLE Foo (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
extraData STRING(80),
active BOOL NOT NULL
) PRIMARY KEY (primaryId, secondaryId);
CREATE TABLE Bar (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
barId STRING(64) NOT NULL
) PRIMARY KEY (primaryId, secondaryId, barId),
INTERLEAVE IN PARENT Foo ON DELETE CASCADE;
CREATE TABLE Baz (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
barId STRING(64) NOT NULL,
bazId STRING(64) NOT NULL,
extraData STRING(80)
) PRIMARY KEY (primaryId, secondaryId, barId, bazId),
INTERLEAVE IN PARENT Bar ON DELETE CASCADE;
CREATE INDEX foo_primaryId_active ON foo (primaryId, active);
CREATE INDEX baz_bazId ON Baz (bazId);
We have 3 tables Foo, Bar, Baz, where Bar is interleaved in Foo and Baz is interleaved in Bar. Along with 2 non-interleaved indexes.
Given the following query where we force the FROM and the JOIN onto the indexes; no explicit tables.
SELECT
baz.primaryId,
baz.secondaryId,
baz.bazId,
baz.extraData
FROM
Baz@{FORCE_INDEX=baz_bazId} AS baz
JOIN
Foo@{FORCE_INDEX=foo_secondaryId_isActive} AS foo
ON
foo.primaryId = baz.parimaryId AND foo.secondaryId = baz.secondaryId
WHERE
baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
foo.active = true
Is there a data locality benefit in this query when forcing the indexes? If we were to later add a 4th table Zap and interleave that table on Foo:
CREATE TABLE Zap (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
bazId STRING(64) NOT NULL,
extraData STRING(80)
) PRIMARY KEY (primaryId, secondaryId, bazId),
INTERLEAVE IN PARENT Foo ON DELETE CASCADE;
CREATE INDEX zap_bazId ON Zap (bazId);
And adjust the above query to include a 3rd JOIN:
JOIN
Zap@{FORCE_INDEX=zap_bazId} AS zap
ON
zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
foo.active = true
zap.extraData IS NULL
Would we obtain any data locality benefit here? Since we are querying on all non-interleaved indexes. Our zap.extraData IS NULL
predicate is not stored on in the index itself, so that would likely need to run back to the Zap table to check.
If there is no data locality benefit from querying on non-interleaved indexes, could we just forgo that extra zap_bazId
index and just alter the Zap table since we know we will be exclusively querying on the bazId for the data it hosts:
CREATE TABLE Zap (
bazId STRING(64) NOT NULL,
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
extraData STRING(80)
) PRIMARY KEY (bazId, primaryId);
the amended query then becomes
JOIN
Zap AS zap -- using a table; aka the implicit PRIMARY_KEY index
ON
zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
baz.bazId = @bazId AND -- using the baz_bazId index to query on the bazId
foo.active = true AND
zap.extraData IS NULL
Now, we lose the CASCADE DELETE here, so it might still just be worth it to interleave and create that additional index and STORING the zap.extraData
into the index to save it from having to go back to the Zap table to pull that information.
The question would still be: Does data locality even come into play when querying/joining exclusively on non-interleaved indexes?
As I understand from the documentation, if the indexes are not interleaved and you query/join by the indexes, data locality doesn't matter. If you are intending to query using the indexes you should only need to interleave the indexes.
Regardless, as you have explained you can keep using the interleaving on your tables if you are interested in the ON DELETE CASCADE
statement, as it cannot be done without interleaving.
Given a table with columns
primaryId
andsecondaryId
where the primary key of the table isprimaryId
. Creating a secondary index onsecondaryId
excludes it from being interleaved into the table.
Yes.
If the indexes are not interleaved, there is no data locality at play
Depends on the query. Joins between a non-interleaved index and a base table are not local. You should consider the STORING
clause in the index to avoid the join. Joins between a table and its parent would be local.
The query explanation dashboard is a useful tool to show how Cloud Spanner executes specific queries. Using it we can analyze the aforementioned queries.
There's a distributed join between baz_bazId
and Baz
and another distributed join with foo_primaryId_active
.
SELECT
baz.primaryId,
baz.secondaryId,
baz.bazId,
baz.extraData
FROM
Baz@{FORCE_INDEX=baz_bazId} AS baz
JOIN
Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
ON
foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
WHERE
baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
AND foo.active = true
There's an added distributed join between Zap
and zap_bazid
which is distributed joined with the rest.
SELECT
baz.primaryId,
baz.secondaryId,
baz.bazId,
baz.extraData
FROM
Baz@{FORCE_INDEX=baz_bazId} AS baz
JOIN
Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
ON
foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
JOIN
Zap@{FORCE_INDEX=zap_bazId} AS zap
ON
zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
baz.bazId = @bazId -- using the baz_bazId index to query on the bazId
AND foo.active = true
AND zap.extraData IS NULL
It uses table Zap2
(the non-interleaved version of Zap
) instead of needing the distributed join between Zap
and zap_bazid
in the second query.
SELECT
baz.primaryId,
baz.secondaryId,
baz.bazId,
baz.extraData
FROM
Baz@{FORCE_INDEX=baz_bazId} AS baz
JOIN
Foo@{FORCE_INDEX=foo_primaryId_active} AS foo
ON
foo.primaryId = baz.primaryId AND foo.secondaryId = baz.secondaryId
JOIN
Zap2 AS zap -- using a table; aka the implicit PRIMARY_KEY index
ON
zap.bazId = @bazId AND zap.primaryId = foo.primaryId
WHERE
baz.bazId = @bazId AND -- using the baz_bazId index to query on the bazId
foo.active = true AND
zap.extraData IS NULL
Spanner will handle all the related network I/O re: the data splits.
Yes.
If indexes can be interleaved there would be a benefit but the keys in those interleaved indexes have to be shared (like with any interleaved table). The docs for locality tradeoffs: "Focus on getting the desired locality for the most important root entities and most common access patterns, and let less frequent or less performance sensitive distributed operations happen when they need to."
Yes.