I have a service written with webflux that has high load (40 request per second)
and I'm encountering a really bad latency and performance issues with behaviours I can't explain: at some point during peaks, the service hangs in random locations as if it doesn't have any threads to handle the request.
The service does however have several calls to different service that aren't reactive - using WebClient
, and another call to a main service that retrieves the main data through an sdk wrapped in Mono.fromCallable(..).publishOn(Schedulers.boundedElastic())
.
So the flow is:
Mono<Request>
Mono<RequestAggregator>
webclient
Mono.fromCallable(MainService.getData(RequestAggregator)).publishOn(Schedulers.boundedElastic())
Mono<Response>
the webclient calls look something like that:
Mono.fromCallable(() -> GoogleService.getToken(account, clientId)
.buildIapRequest(REQUEST_URL))
.map(httpRequest -> httpRequest.getHeaders().getAuthorization())
.flatMap(authToken -> webClient.post()
.uri("/call/some/endpoint")
.header(HttpHeaders.AUTHORIZATION, authToken)
.header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON_VALUE)
.header(HttpHeaders.ACCEPT, MediaType.APPLICATION_JSON_VALUE)
.body(BodyInserters.fromValue(countries))
.retrieve()
.onStatus(HttpStatus::isError, clientResponse -> {
log.error("{} got status code: {}",
ERROR_MSG_ERROR, clientResponse.statusCode());
return Mono.error(new SomeWebClientException(STATE_ABBREVIATIONS_ERROR));
})
.bodyToMono(SomeData.class));
sometimes step 6 hangs for more than 11 minutes, and this service does not have any issues. It's not reactive but responses take ~400ms
Another thing worth mentioning is that MainService
is a heavy IO operation that might take 1 minute or more.
I feel like a lot of request hangs on MainService
and theren't any threads left for the other operations, does that make sense? if so, how does one solve something like that?
Can someone suggest any reason for this issue? I'm all out of ideas
It's not possible to tell for sure without knowing the full application, but indeed the blocking IO operation is the most likely culprit.
Schedulers.boundedElastic()
, as its name suggests, is bounded. By default the bound is "ten times the number of available CPU cores", so on a 2-core machine it would be 20. If you have more concurrent requests than the limit, the rest is put into a queue waiting for a free thread indefinitely. If you need more concurrency than that, you should consider setting up your own scheduler using Scheduler.fromExecutor
with a higher limit.