Search code examples
nettyspring-webfluxgraphql-java

Spring Webflux unaccounted time loss


We have recently switched to ExpediaGroups GraphQL library which is based on Spring Webflux.

Since switching our Jaeger Traces show gaps before and after the last database query / span is created: enter image description here There is no computation heavy work done before or after performing the aforementinoned database queries besides trivial entity -> DTO mapping. An initial investigation via VisualVM has shown no obvious hotspots but we are losing an overall of 3-6ms in a local environment (dev, prod suffer even more) which increases the overall response time by 100%.

We are at a loss where to go from here: Is it a WebFlux issue? Is it a issue in regards to the linked library? We are executing all of our logic in a separate ThreadPool which is not saturated (these results are present for one client non-concurrently calling our GraphQL endpoint) so we shouldn't block the event loop of Netty (even if we did: it shouldn't create those "gaps" if my understanding of Webflux is correct).

I am looking for a way to further investigate this issue or any configuration knobs.

The "gaps" in between the database calls have been identified to be framework related and can be circumvented by restructuring our code, the head and tail "gaps" cannot be accounted for in profilers / worked around. Furthermore we are not losing any tracing related information across thread boundaries, that has been accounted for.

Additional information:

  • Our response size on average is below 1 kb
  • There is no reverse proxy in front of this service in a local environment which also exhibits this issue
  • All traffic is HTTP, not HTTPS

Solution

  • After a lot of debugging we have found the following answers:

    • The "gaps" in front and at the end of each request is just pure I/O of the incoming requests and the corresponding response. This also explains the bigger gaps in production when the requestee is slowly consuming the response / sending the request which doesn't hold true for localhost requests
    • The larger gaps in between can be attributed to lacking warmup (JIT) which seems necessary for the WebFlux stack
    • The overall "bad" performance is also warmup related (JIT)

    Furthermore the suggested PreparsedDocumentProvider GraphQL caching (thank you Korashen) seems to be giving a decent performance uplift:

        @Bean
        fun preparsedDocumentProvider(): PreparsedDocumentProvider = object : PreparsedDocumentProvider {
            @Suppress("MagicNumber")
            private val cache: Cache<String, PreparsedDocumentEntry> = Cache2kBuilder
                .of(String::class.java, PreparsedDocumentEntry::class.java)
                // avoid cache attacks
                .entryCapacity(1024)
                .eternal(true)
                .build()
    
            override fun getDocument(
                executionInput: ExecutionInput,
                parseAndValidateFunction: Function<ExecutionInput, PreparsedDocumentEntry>
            ) = cache.computeIfAbsent(executionInput.query) { parseAndValidateFunction.apply(executionInput) }
        }