Search code examples
spring-bootkubernetes-ingressload-testinggatlinglatency

Reason for peculiar / descending triangle latency spike pattern during load tests


I am having a hard time to identify the underlying issue for the following latency pattern for the max percentile of my application: enter image description here

This is a gatling chart that shows 4 minutes of load testing. The first two minutes are warmup of the same scenario (thats why it has no latency graph).

Two triangles (sometimes more) with a nearly identical slope are clearly visible and reproducible across multiple test runs, no matter how many application instances we deploy behind our load balancer: enter image description here

I am looking for more paths to investigate as I have a hard time googling for this pattern - it strikes me as particularly odd that this triangle is not "filled" but just consists of spikes. Furthermore the triangle feels "inverted": if this would be a scenario with ever-increasing load (which it isn't) I would expect to see this kind of triangle manifest with an inverted slope - this slope just doesn't make any sense to me.

Technical context:

  • This is for a Spring Boot application with a PostgreSQL database in AWS
  • There are 6 pods deployed in our Kubernetes cluster, auto-scaling was disabled for this test
  • Keep-alive is used by our Gatling test (see answer below, turns out this was a lie)
  • Kubernetes ingress configuration is left as-is which implicates keep-alive to each upstream if I read the defaults correctly
  • Both the database and CPU per pod are not maxed out
  • The network uplink of our load testing machine is not maxed out and the machine does nothing else besides running the load test
  • The load (requests / sec) on the application is nearly constant and not changing after the warmup / during the measurement
  • Garbage collection activity is low

Here is another image to demonstrate the "triangle" before we made some application-side optimizations to request latency: enter image description here


Solution

  • This turned out to be a two-part issue:

    • we thought our load test was using keep-alive connection, which it didn't (ssl handshakes are pricey, ephemeral ports run out after some time)
    • a custom priority based task scheduling system (an earlier request and its subtasks have higher priority than later requests) "lost" it's task priority because of how Kotlin coroutines work (thread A gets suspended during a coroutine and another picks up the remaining work later, losing any threadlocal priority - this can be fixed via asContextElement())

    While this does not explain the more than peculiar shape of the latency pattern it did resolve the main issues we had and the pattern is gone.