c#google-cloud-platform google-cloud-spanner

Investigating Google Cloud Spanner Latency - are secondary indexes locking?

I have an application which inserts hundreds of thousands of rows (each with only 3 columns) to spanner using the following code ran concurrently with batches of 5000.

public async Task ExecuteBatchInsertOrReplaceAsync<T>(List<T> items, SpannerConnection connection)
{
    // This will throw if the items count * column > 20,000. In this case, batch the batches.
    await connection.RunWithRetriableTransactionAsync(async transaction =>
    {
        await Task.WhenAll(items.Select(item => ExecuteInsertOrReplaceAsync(item, connection, transaction)));
    });
    
    Logger.LogInformation($"ExecuteBatchInsertOrReplaceAsync executed on {items.Count} items.");
}

public async Task<int> ExecuteInsertOrReplaceAsync<T>(T item, SpannerConnection connection, SpannerTransaction spannerTransaction = null)
{
    var parameters = new SpannerParameterCollection().CreateKeys<T>();
    parameters.PopulateFrom(item);

    await using var command = connection.CreateInsertOrUpdateCommand(TableName, parameters);
    command.Transaction = spannerTransaction;

    var count = await command.ExecuteNonQueryAsync();
    return count;
}

But when executed spanner runs with latency, making the writes take more time than I'd like. Spanner monitoring shows I have a latency of around 40s. My write throughput is about 14MiB/s using 5 pods.

The table I'm inserting to has a single unique index. The Spanner docs suggest that high latency can be the result of table locking. Checking Spanner's lock stats with

SELECT CAST(s.row_range_start_key AS STRING) AS row_range_start_key,
   t.total_lock_wait_seconds,
   s.lock_wait_seconds,
   s.lock_wait_seconds/t.total_lock_wait_seconds frac_of_total,
   s.sample_lock_requests,
   t.interval_end
FROM spanner_sys.lock_stats_total_10minute t, spanner_sys.lock_stats_top_10minute s
WHERE
  t.interval_end = "2022-03-04T16:00:00Z"

shows me that there are indeed many locks that are being awaited for several seconds, each with sample_lock_requests = _Index_ix_my_index_name._exists,Exclusive.

So here is my question: is spanner slowing down my writes because my unique secondary index is locking the table for each write, or could the latency be caused by something else? If I'm missing any key information, my apologies, please let me know.

Thanks

Solution

is spanner slowing down my writes because my unique secondary index is locking the table for each write

Please note that Spanner does not lock the table, the lock granularity is row-and-column, or cell. More details about locking: https://cloud.google.com/spanner/docs/transactions#locking

or could the latency be caused by something else?

It is often hard to conclude what caused high latency in Spanner without knowing more details of your workload: schema, key ranges you are inserting, is your database currently empty or already have data, if have data how they are distributed, etc. Generally speaking, unique index constraint will be validated at commit time, thus if you are not inserting a spread-out key range, there would be lock contentions. But it is hard to conclude whether the 40s latency are fully contributed by this factor.

This page https://cloud.google.com/spanner/docs/bulk-loading have some information on best practices on bulk loading. If your database is currently empty and you are doing one time bulk loading, it would be faster if you delete the constraint and add it back after data load.

If you are inserting into a non-empty table, try to use a smaller mutation size per transaction, which might help.

You could also use Key visualizer https://cloud.google.com/spanner/docs/key-visualizer to see whether your inserts cause hot spots or rolling hot spots, which usually contribute to high latency.

Please feel free to file a service ticket if you need more detailed help.