Search code examples
javaneo4jneo4j-embedded

Neo4j embedded : Upgrade from 2.3.9 to 3.2.3 : initial_hosts are not communicating with each other


I have upgraded my neo4j embedded DB from 2.3.9 to 3.2.3 in SINGLE mode, it has upgraded successfully. After upgrade, I have enabled "HA" mode. While running neo4j with 3 clusters, I am facing below issue.

Individually servers are running fine in HA mode. (i.e. ha.initial_hosts = "ip_address_1:5101"), but if I add three servers under initial_hosts (as shown in the config), all three servers are stopping immediately.

Am I missing any configuration? Please suggest.

Config:

neo4j {
            # Enable these two options while upgrading neo4j database.
            # dbms.allow_format_migration=true

            # or weak or strong
    cache_type = "weak"
            # Reduce the default page cache memory allocation
            dbms.memory.pagecache.size="6G"

            # Port to listen to for incoming backup requests.
            dbms.backup.address = ${local.private-ip}":6367"

            # Unique server id for this Neo4j instance
            # can not be negative id and must be unique
            ha.server_id="1"

            # List of other known instances in this cluster
            ha.initial_hosts = "ip_1:5101,ip_2:5101,ip_3:5101"

            # ha.initial_hosts = "ip_1:5101"
            # ha.cluster_server = ${local.private-ip}":5101"

            # IP and port for this instance to bind to for communicating cluster information
            # with the other neo4j instances in the cluster.
            ha.host.coordination = ${local.private-ip}":5101"

            # IP and port for this instance to bind to for communicating data with the
            # other neo4j instances in the cluster.
            ha.host.data = ${local.private-ip}":6365"

            # HA - High Availability
            # SINGLE - Single mode, default.
            dbms.mode="HA"

            # HTTP Connector
            dbms.connector.http.enabled="true"
            dbms.connector.http.listen_address=":7474"

            # Bolt connector
            dbms.connector.bolt.enabled="true"
            dbms.connector.bolt.tls_level="OPTIONAL"
            dbms.connector.bolt.listen_address=":7689"
}

From the neo4j debug.log:

2017-10-09 12:35:47.153+0000 ERROR [o.n.k.h.c.m.HighAvailabilityModeSwitcher] Error while trying to switch to slave Cannot find the master among [] with master serverId=1 and uri=ha://ip_address_1:6365?serverId=1
    java.lang.IllegalStateException: Cannot find the master among [] with master serverId=1 and uri=ha://ip_address_1:6365?serverId=1
            at org.neo4j.kernel.ha.cluster.SwitchToSlave.checkMyStoreIdAndMastersStoreId(SwitchToSlave.java:263)
            at org.neo4j.kernel.ha.cluster.SwitchToSlaveBranchThenCopy.checkDataConsistency(SwitchToSlaveBranchThenCopy.java:142)
            at org.neo4j.kernel.ha.cluster.SwitchToSlave.executeConsistencyChecks(SwitchToSlave.java:478)
            at org.neo4j.kernel.ha.cluster.SwitchToSlave.switchToSlave(SwitchToSlave.java:221)
            at org.neo4j.kernel.ha.cluster.modeswitch.HighAvailabilityModeSwitcher$1.run(HighAvailabilityModeSwitcher.java:355)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
            at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:109)
    2017-10-09 12:35:47.154+0000 INFO [o.n.k.h.c.m.HighAvailabilityModeSwitcher] Attempting to switch to slave in 300s

Solution

  • Default Value for join_timeout is 30 seconds.

    Timeout for joining a cluster. Defaults to ha.broadcast_timeout. Note that if the timeout expires during cluster formation, the operator may have to restart the instance or instances.

    ha.join_timeout=10m

    https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_ha.join_timeout