latency google-cloud-spanner observability

I got an alert for a latency spike in my Cloud Spanner database. How can I debug it?

I am using Cloud Spanner as the database for my application. For certain queries the latency has spiked. I would like to know how can I debug it?

I tried looking at the system insights page but couldn't co-relate it with whats happening in my application.

Solution

The System Insights page has a lot of useful information. Check it for the problematic instance. Typically, high latency are caused by the following reasons:

Underprovisioned Instance : For high priority CPUs, Cloud Spanner recommends a CPU utilization threshold of 65% for single region setups and 45% for multi-region setups. Start by debugging if the CPU Utilization is caused by user or system load. If the CPU utilization is consistently higher than the recommended Cloud Spanner threshold, the first action is to increase the number of nodes. This helps in distributing the traffic across nodes allowing Cloud Spanner to deliver a better latency. Adding more nodes will bring the latency down if it is due to resource constraints.
Expensive/Inefficient Queries : Expensive/Inefficient Queries running on a database may lead to higher latencies. Check the Query Insights dashboards to see the most expensive queries and select an expensive query and see if the ratio of rows scanned vs rows returned is high on the Query Details dashboard. If that is the case maybe the query can be optimized further. Further the Query Details page also shows Sampled Query Plans which stores the history of query plans and can help to find out what resulted in the higher latencies for the query.
Lock Contention : Lock wait time chart in System Insights shows the lock wait time taken for the instance/database. Higher lock wait times would mean higher latencies for the requests. To ensure the consistency of multiple concurrent transactions, Spanner uses locks to control access to the data. Lock contention occurs when many transactions require frequent access to the same lock, leading to high latencies for reads and writes. Debugging of lock contention can be done using the Lock and Transaction Insights.
Hotspotting : Check if there are any hotspots in the database, which can be verified using the Key Visualiser. If there are indeed any hotspots, refer to the Schema Design Best Practices to avoid them.
Large Transactions : Large transactions with many participants(splits) can lead to higher latencies. Data needs to be read and written from the participants in the transaction so a higher participant count will mean more network overhead for coordination and additionally more processing will be done in the same transaction. The participants count for transactions can be found in the Transaction Details page of Transaction Insights.

Further latencies for requests may be high due to session misconfiguration on the client side. Client libraries uses sessions to run queries on cloud spanner and misconfiguration of sessions may cause high latency because creating a session is expensive, General guideline is to have min_sessions equal to maximum write QPS expected during the load for better latency numbers.