docker hazelcast vert.x high-availability

Vert.x high availability is not working

Brief description

I'm just getting started with VertX and I wanted to try the high availability feature with a little toy example. In my setup I have a fatjar application which is deployed to several docker containers. The application programmatically creates an instance of VertX and starts one verticle called ContainerVerticle. This runs an HTTP server and acts as a "launcher" - when a "SPAWN" command is received it deploys another verticle called AppVerticle in high-availability mode. The idea is that I want to run this on 3 containers and then kill the JVM on one of them, that should redeploy the AppVerticle to another docker container.

Actual result: the verticles can talk to each other using the event bus, also the cluster seems to work correctly: according to the log file the members see each other. However, when I kill one verticle, it is not getting redeployed.

More details

(All source code is written in Kotlin) Vertx initialization:

    val hzConfig = Config()
    val mgr = HazelcastClusterManager(hzConfig)  // empty config -> use default
    val hostAddress = getAddress() // get the local ip address (not localhost!)

    val options = VertxOptions()
            .setClustered(true)
            .setClusterHost(hostAddress)
            .setClusterPort(18001)
            .setClusterManager(mgr)
            //.setQuorumSize(2)
            .setHAEnabled(true)

    val eventBusOptions = EventBusOptions()
    eventBusOptions
            .setClustered(true)
            .setHost(hostAddress)
            .setPort(18002)
    options.setEventBusOptions(eventBusOptions)

    Vertx.clusteredVertx(options) { res ->
        if (res.succeeded()) {
            val vertx = res.result()
            vertx.deployVerticle(ContainerVerticle::class.java.name,
                    DeploymentOptions()
                            .setHa(false)) // ContainerVerticle should not restart
        }
    }

ContainerVerticle (our 'launcher')

class ContainerVerticle : AbstractVerticle() {
    ...
    override fun start(startFuture: Future<Void>?) {

        val router = createRouter()
        val port = config().getInteger("http.port", 8080)

        vertx.eventBus().consumer<Any>("mynamspace.container.spawn") { message ->
            val appVerticleID = message.body()
            log.info(" - HANDLE SPAWN message \"${appVerticleID}\"")
            val appVerticleConfig = JsonObject().put("ID", appVerticleID)

            vertx.deployVerticle(AppVerticle::class.java.name, // Deploy the APP!!!
                    DeploymentOptions()
                            .setConfig(appVerticleConfig)
                            .setInstances(1)
                            .setHa(true))
        }

        vertx.createHttpServer()...  // omitted (see github link)
    }

    private fun createRouter(): Router { ... } // omitted (see github link)

    val handlerRoot = Handler<RoutingContext> { routingContext ->
        val cmd = routingContext.bodyAsString
        val tokens = cmd.split(" ")
        if (tokens[0] == "spawn") {
            vertx.eventBus().send("mynamspace.container.spawn", tokens[1])  // round-robin
            routingContext.response().end("Successfully handled command ${cmd}\n")
        } else if (tokens[0] == "send") {
            vertx.eventBus().send("mynamspace.app.${tokens[1]}", tokens[2])
            routingContext.response().end("success\n")
        } else {
            routingContext.response().end("ERROR: Unknown command ${cmd}\n")
        }
    }
}

The last part: the AppVerticle:

class AppVerticle : AbstractVerticle() {
    var timerID = 0L
    override fun start(startFuture: Future<Void>?) {
        val id = config().getString("ID")
        log.info(" SPAWNED app verticle \"${id}\"")

        vertx.eventBus().consumer<Any>("mynamspace.app.${id}") { message ->
            val cmd = message.body()
            log.info(" - app verticle \"${id}\" handled message ${cmd}")
        }

        timerID = vertx.setPeriodic(1000) {
            log.info(" - app verticle \"${id}\" is alive")
        }
    }
}

Running

Open 3 terminals and run 3 docker instances. Minor details: here we re-map port 8080 to three different ports 8081, 8082, 8083, we also give unique names to the containers: cont1, cont2, cont3

docker run --name "cont1" -it --rm -p 8081:8080 -v $PWD/build/libs:/app anapsix/alpine-java java -jar /app/vertxhaeval-1.0-SNAPSHOT-all.jar

docker run --name "cont2" -it --rm -p 8082:8080 -v $PWD/build/libs:/app anapsix/alpine-java java -jar /app/vertxhaeval-1.0-SNAPSHOT-all.jar

docker run --name "cont2" -it --rm -p 8083:8080 -v $PWD/build/libs:/app anapsix/alpine-java java -jar /app/vertxhaeval-1.0-SNAPSHOT-all.jar

Observation 1

The cluster members seem to see each other because of the following message:

Members [3] {
    Member [172.17.0.2]:5701 - 1d50394c-cf11-4bd7-877e-7e06e2959940 this
    Member [172.17.0.3]:5701 - 3fa2cff4-ba9e-431b-9c4e-7b1fd8de9437
    Member [172.17.0.4]:5701 - b9a3114a-7c15-4992-b609-63c0f22ed388
}

Also we can span the AppContainer:

curl -d "spawn -={Application-1}=-" -XPOST http://localhost:8083

The message bus seems to work correctly, because we see that the spawn message gets delivered to ContainerVerticle in round-robin fashion.

Observation 2- the problem

Now let's try to kill the verticle (assuming it runs in cont2):

docker kill --signal=SIGKILL  cont2

The remaining containers seems to react to that event, the log file has something like this:

Aug 14, 2018 8:18:45 AM com.hazelcast.internal.cluster.ClusterService
INFO: [172.17.0.4]:5701 [dev] [3.8.2] Removing Member [172.17.0.2]:5701 - fbe67a02-80a3-4207-aa10-110fc09e0607
Aug 14, 2018 8:18:45 AM com.hazelcast.internal.cluster.ClusterService
INFO: [172.17.0.4]:5701 [dev] [3.8.2] 

Members [2] {
    Member [172.17.0.3]:5701 - 8b93a822-aa7f-460d-aa3e-568e0d85067c
    Member [172.17.0.4]:5701 - b0ecea8e-59f1-440c-82ca-45a086842004 this
}

However the AppVerticle does NOT get redeployed.

The full source code is available on github: https://github.com/conceptacid/vertx-ha-eval

Solution

I spent several hours debugging this, but finally found it.

So here is the solution:

Your verticle start method header is:

override fun start(startFuture: Future<Void>?)

You're overriding start methods which gives you the future that will waited for after the start of the verticle. Vert.x waits forever for the completion of this future since you do not call

startFuture.complete()

at the end of the method.

So the verticle will never be added to the verticle-list of the HAManager and so will not be redeployed.

Alternatively, you can use

override fun start()

as method header if your verticle does a simple, synchronous start-up.

Hope this helps.