spring-boot microservices cqrs event-sourcing axon

Scaling an Axon application - command handling load tests failing

I've created Axon application with two Spring Boot services - hotel-booking-command and hotel-booking-query, for command side and query side respectively. These services are partially and loosely based on the sample application provided by AxonIQ. I'm using Axon Server as an event store and message router. The services are hidden behind Spring Cloud Gateway. I'm using Consul as a discovery service. Everything seems to work fine, as long as I use only one instance of the command side application. When I'm using 2 or more instances and the load gets higher, connection to the Axon Server is being lost on all instances:

2022-06-26 17:00:37.675  INFO 86356 --- [ctor-http-nio-4] o.a.m.interceptors.LoggingInterceptor    : [AddRoomCommand] executed successfully with a [Integer] return value
2022-06-26 17:00:37.675  INFO 86356 --- [ctor-http-nio-4] o.a.m.interceptors.LoggingInterceptor    : Dispatched messages: [RoomAddedEvent]
2022-06-26 17:01:10.258  INFO 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel    : Unable to recover current connection to AxonServer. Attempting to reconnect...
2022-06-26 17:01:10.264  INFO 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel    : Requesting connection details from localhost:8124
2022-06-26 17:01:15.272  WARN 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel    : Connecting to AxonServer node [localhost:8124] failed.

io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 4.997735389s. [closed=[], open=[[buffered_nanos=4998488350, waiting_for_connection]]]
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) ~[grpc-stub-1.43.0.jar:1.43.0]
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) ~[grpc-stub-1.43.0.jar:1.43.0]
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) ~[grpc-stub-1.43.0.jar:1.43.0]
    at io.axoniq.axonserver.grpc.control.PlatformServiceGrpc$PlatformServiceBlockingStub.getPlatformServer(PlatformServiceGrpc.java:250) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
    at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.connectChannel(AxonServerManagedChannel.java:115) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
    at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.createConnection(AxonServerManagedChannel.java:335) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
    at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.ensureConnected(AxonServerManagedChannel.java:308) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
    at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.lambda$scheduleConnectionCheck$4(AxonServerManagedChannel.java:378) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[na:na]
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[na:na]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
    at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]

2022-06-26 17:01:15.272  INFO 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel    : Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms
2022-06-26 17:01:15.272  INFO 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel    : Connection to AxonServer lost. Attempting to reconnect...
2022-06-26 17:01:15.273  INFO 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel    : Requesting connection details from localhost:8124
2022-06-26 17:01:20.275  WARN 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel    : Connecting to AxonServer node [localhost:8124] failed: DEADLINE_EXCEEDED: deadline exceeded after 4.988964197s. [closed=[], open=[[buffered_nanos=4989266439, waiting_for_connection]]]
2022-06-26 17:01:20.275  INFO 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel    : Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms

The logs from Gatling very quickly start to look like this (250 requests per second were executed, the code of the Gatling simulation is available here):

for HTTP POST "/api/hotel-booking/command/rooms"

io.netty.channel.AbstractChannel$AnnotatedConnectException: Operation timed out: /192.168.0.12:8082
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
    *__checkpoint ⇢ org.springframework.cloud.gateway.filter.WeightCalculatorWebFilter [DefaultWebFilterChain]
    *__checkpoint ⇢ org.springframework.boot.actuate.metrics.web.reactive.server.MetricsWebFilter [DefaultWebFilterChain]
    *__checkpoint ⇢ HTTP POST "/api/hotel-booking/command/rooms" [ExceptionHandlingWebHandler]
Original Stack Trace:
Caused by: java.net.ConnectException: Operation timed out
    at java.base/sun.nio.ch.Net.pollConnect(Native Method) ~[na:na]
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[na:na]
    at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[na:na]

Usually, only the first 500 or 1000 requests are handled correctly.

Current version of the application (including load tests, located in gatling module) is available here: https://github.com/a-glapinski/event-sourcing-and-cqrs-jvm/tree/api-testing. I'm willing to provide more details about the application if needed; now I'm not sure what can be wrong and where to look at.

Is there something that is potentially missing in configuration of my Axon command side application? Should I change some configuration in Axon Server? Or maybe the concept of the whole system I've created is wrong in the context of Axon Framework and my application won't be able to scale at all?

Solution

Thanks for sharing your project. It is an amazing initiative!

I have a couple of concerns about this design on the strategic level:

1. implementation("org.axonframework:axon-spring-boot-starter"), implementation("org.axonframework.extensions.springcloud:axon-springcloud-spring-boot-starter") are used together to distribute commands. There is no need for this. Axon Server acts as a service registry and discovery for your commands. It will route commands to appropriate command handlers (much better then Consul).
1. If you really need to use Consul/Eureka as service discovery, my advice is to limit Consul to only discovering Web components/Controllers, not Axon Command handlers (let Axon Server do that).

This implies

removing springcloud:axon-springcloud-spring-boot-starter as a first step. You do not need two mechanisms (Consul, AxonServer) for service discovery and message/command routing.
potentially extracting web components (controllers) so they can be load balanced and discovered separately from the deep decision-making components like Aggregates (command handling components). Please notice, that this will fall back to just a regular Spring Boot configuration of Consul (no need for springcloud:axon-springcloud-spring-boot-starter on this level).

It would be nice to see the result of the same test without Gateway and Consul at first. In this case, you should be able to identify the bottleneck in a better way.