I've created Axon application with two Spring Boot services - hotel-booking-command
and hotel-booking-query
, for command side and query side respectively. These services are partially and loosely based on the sample application provided by AxonIQ. I'm using Axon Server as an event store and message router. The services are hidden behind Spring Cloud Gateway. I'm using Consul as a discovery service. Everything seems to work fine, as long as I use only one instance of the command side application. When I'm using 2 or more instances and the load gets higher, connection to the Axon Server is being lost on all instances:
2022-06-26 17:00:37.675 INFO 86356 --- [ctor-http-nio-4] o.a.m.interceptors.LoggingInterceptor : [AddRoomCommand] executed successfully with a [Integer] return value
2022-06-26 17:00:37.675 INFO 86356 --- [ctor-http-nio-4] o.a.m.interceptors.LoggingInterceptor : Dispatched messages: [RoomAddedEvent]
2022-06-26 17:01:10.258 INFO 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel : Unable to recover current connection to AxonServer. Attempting to reconnect...
2022-06-26 17:01:10.264 INFO 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel : Requesting connection details from localhost:8124
2022-06-26 17:01:15.272 WARN 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel : Connecting to AxonServer node [localhost:8124] failed.
io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 4.997735389s. [closed=[], open=[[buffered_nanos=4998488350, waiting_for_connection]]]
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) ~[grpc-stub-1.43.0.jar:1.43.0]
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) ~[grpc-stub-1.43.0.jar:1.43.0]
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) ~[grpc-stub-1.43.0.jar:1.43.0]
at io.axoniq.axonserver.grpc.control.PlatformServiceGrpc$PlatformServiceBlockingStub.getPlatformServer(PlatformServiceGrpc.java:250) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.connectChannel(AxonServerManagedChannel.java:115) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.createConnection(AxonServerManagedChannel.java:335) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.ensureConnected(AxonServerManagedChannel.java:308) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.lambda$scheduleConnectionCheck$4(AxonServerManagedChannel.java:378) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[na:na]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
2022-06-26 17:01:15.272 INFO 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel : Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms
2022-06-26 17:01:15.272 INFO 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel : Connection to AxonServer lost. Attempting to reconnect...
2022-06-26 17:01:15.273 INFO 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel : Requesting connection details from localhost:8124
2022-06-26 17:01:20.275 WARN 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel : Connecting to AxonServer node [localhost:8124] failed: DEADLINE_EXCEEDED: deadline exceeded after 4.988964197s. [closed=[], open=[[buffered_nanos=4989266439, waiting_for_connection]]]
2022-06-26 17:01:20.275 INFO 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel : Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms
The logs from Gatling very quickly start to look like this (250 requests per second were executed, the code of the Gatling simulation is available here):
for HTTP POST "/api/hotel-booking/command/rooms"
io.netty.channel.AbstractChannel$AnnotatedConnectException: Operation timed out: /192.168.0.12:8082
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
*__checkpoint ⇢ org.springframework.cloud.gateway.filter.WeightCalculatorWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.boot.actuate.metrics.web.reactive.server.MetricsWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ HTTP POST "/api/hotel-booking/command/rooms" [ExceptionHandlingWebHandler]
Original Stack Trace:
Caused by: java.net.ConnectException: Operation timed out
at java.base/sun.nio.ch.Net.pollConnect(Native Method) ~[na:na]
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[na:na]
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[na:na]
Usually, only the first 500 or 1000 requests are handled correctly.
Current version of the application (including load tests, located in gatling
module) is available here: https://github.com/a-glapinski/event-sourcing-and-cqrs-jvm/tree/api-testing. I'm willing to provide more details about the application if needed; now I'm not sure what can be wrong and where to look at.
Is there something that is potentially missing in configuration of my Axon command side application? Should I change some configuration in Axon Server? Or maybe the concept of the whole system I've created is wrong in the context of Axon Framework and my application won't be able to scale at all?
Thanks for sharing your project. It is an amazing initiative!
I have a couple of concerns about this design on the strategic level:
implementation("org.axonframework:axon-spring-boot-starter")
,
implementation("org.axonframework.extensions.springcloud:axon-springcloud-spring-boot-starter")
are used together to distribute commands. There is no
need for this. Axon Server acts as a service registry and discovery for your commands. It will route commands to appropriate command handlers (much better then Consul).This implies
springcloud:axon-springcloud-spring-boot-starter
as a
first step. You do not need two mechanisms (Consul, AxonServer) for service discovery and message/command routing.springcloud:axon-springcloud-spring-boot-starter
on this level).It would be nice to see the result of the same test without Gateway and Consul at first. In this case, you should be able to identify the bottleneck in a better way.