Search code examples
gogoogle-cloud-spanner

Why is spanner Query using ReadOnlyTransaction in golang goroutines incrementally slow


I am trying to query approx 10,000 rows from a table. After trying various other options involving limit offset and not finding desired success, I am trying to query single row in each goroutine. Thought processes being that each row only takes ~5ms for query and fetch but a batch of 10k takes over 20s.

Shown below is simplified version of the code:

func queryEmp(IDs[]string, spannerClient *spanner.Client) (Employee,error){
query := "Select name from Employee Where id = @id"

    g, gCtx := errgroup.WithContext(ctx)
    for _, ID := range IDs {
        id := ID
        g.Go(func() error {
    
            tx := spannerClient.Single() 
            defer tx.Close()

            stmt2 := spanner.NewStatement(query)
            stmt2.Params = map[string]interface{}{
                "ID": id,
            }

            qstart := time.Now()
            it := tx.Query(gCtx, stmt2)
            defer it.Stop()
            logrus.Debugf("%s took %v \n", "query execution.", time.Since(qstart))

            for {
                row, err := it.Next()
                if err == iterator.Done {
                    break
                }
                if err != nil {
                    return err
                }

                var eID string
                if err := row.Column(0, &eID); err != nil {
                    return err
                }

            }

            return nil
        })
    }
    err = g.Wait()
}

The resulting trace is as below:

{"message":"query execution. took 39.677µs \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
{"message":"query execution. took 34.125µs \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
{"message":"query execution. took 26.634µs \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
{"message":"query execution. took 29.303µs \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
...
...
...
{"message":"query execution. took 188.749562ms \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
{"message":"query execution. took 276.424692ms \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
{"message":"query execution. took 188.62849ms \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
{"message":"query execution. took 217.067524ms \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
{"message":"query execution. took 276.949166ms \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
...
...
...
{"message":"query execution. took 454.64281ms \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
{"message":"query execution. took 452.0848ms \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
{"message":"query execution. took 525.748738ms \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
{"message":"query execution. took 454.704656ms \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
{"message":"query execution. took 455.4276ms \n","severity":"debug","time":"2023-11-03T20:51:29-04:00"}
...
...
...
{"message":"query execution. took 6.767574136s \n","severity":"debug","time":"2023-11-03T20:52:00-04:00"}
{"message":"query execution. took 6.780578444s \n","severity":"debug","time":"2023-11-03T20:52:00-04:00"}
{"message":"query execution. took 6.785085491s \n","severity":"debug","time":"2023-11-03T20:52:00-04:00"}
{"message":"query execution. took 6.779527006s \n","severity":"debug","time":"2023-11-03T20:52:00-04:00"}

It starts out great and as expected but the query time keeps increasing.

MaxSessions and MinSessions for spannerClient are 100 so one would imagine that it may see slight slowdown after 100 but that isn't the case.

Read here that:

Sessions can execute only one transaction at a time. Standalone reads, writes, and queries use a transaction internally, and count toward the one transaction limit.

Non iterative query (ReadRow etc) gives me same result.

Having tx := spannerClient.Single() outside of for loop also gives similar result.

Question(s):

  1. Does this mean the goroutines are trying to use same session/transaction despite doing spannerClient.Single() in the goroutine?
  2. How can I modify above to fix the issue?

Solution

  • TLDR: The default max session pool size is 400, meaning that you will never have more than 400 queries running in parallel. You would need to increase the session pool size to achieve this kind of concurrency.


    First of all: I do not think that sending 10,000 queries in parallel to let each read a single row is going to be the most efficient solution to your problem. If there is no other condition that you can use to filter than the employee ID, and those IDs are scattered all over the place, then it would still be more efficient to create a query in the form

    select * from employees where id in unnest(@ids)
    

    See this comment for a complete example: https://github.com/googleapis/google-cloud-go/issues/858#issuecomment-550982307


    Coming back to your concrete question:

    1. You are actually not measuring the time that it takes to execute the query. It is a bit confusing, but the it := tx.Query(gCtx, stmt2) lines does not execute the query, it only prepares the query for execution. The execution happens the first time that you call row, err := it.Next(). You can also see this in the execution times that are being logged. The first statements seem to be executed in 30 microseconds, which is impossible.
    2. This means that it is something on your client that is limiting your progress, and in this case I am pretty sure that it is the max size of your session pool. The default max session pool size is 400. That means that you will have at most 400 queries running in parallel. The increasing waiting times that you are seeing are goroutines being placed in a waiter queue waiting for a session to become available. The waiting times will be longer for the goroutines at the end of the queue.