Elasticsearch 7 on Openstack instances unable to setup ES cluster

I'm trying to set up an Elasticsearch cluster on Openstack. I've got the two Openstack instances each running ES and the instances are able to ping each other and curl eachothers ES instance. But no matter how I configure the Elasticsearch.yml files I can't seem to get them to form a cluster.

I am using Elasticsearch 7.3.2 on both instances. I am using non-floating IPs in the below configs.

Instance 1 - Elasticsearch.yml

cluster.name: my-cluster
node.name: node-1
node.master: true
node.data: true
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: [_local_,_site_]
http.port: 9200
discovery.seed_hosts: ["<INSTANCE1-IP>:9300", "<INSTANCE2-IP>:9300"]
cluster.initial_master_nodes: ["<INSTANCE2-IP>:9300"]

Instance 2 - Elasticsearch.yml

cluster.name: my-cluster
node.name: node-2
node.master: false
node.data: true
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: [_local_,_site_]
http.port: 9200
discovery.seed_hosts: ["<INSTANCE1-IP>:9300", "<INSTANCE2-IP>:9300"]
cluster.initial_master_nodes: ["<INSTANCE2-IP>:9300"]

Using those configs, the master node (Instance1) loads up fine but when checking the health of the second node (Instance2) I get a master_not_discovered_exception (503). Any ideas?

Checking the logs on node-2 displays the following information:

[2019-10-01T09:45:53,126][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [node-2] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2019-10-01T09:45:53,127][WARN ][r.suppressed             ] [node-2] path: /_cluster/health, params: {pretty=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.onTimeout(TransportMasterNodeAction.java:251) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:572) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.2.jar:7.3.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:835) [?:?]
[2019-10-01T09:45:59,754][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node-2] master not discovered yet: have discovered [{node-2}{Sb2fOZEKR42_4sB2XnmShg}{LgJ0DLojSay7KV2_cXgdpw}{<INSTANCE2-IP>}{<INSTANCE2-IP>:9300}{di}{ml.machine_memory=33728778240, xpack.installed=true, ml.max_open_jobs=20}, {node-1}{HqONwd3fQHSWZxxEtctcog}{cu5KW146S8-04oBBCqL3QA}{<INSTANCE1-IP>}{<INSTANCE1-IP>:9300}{dim}{ml.machine_memory=33728778240, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [<INSTANCE2-IP>:9300] from hosts providers and [] from last-known cluster state; node term 0, last-accepted version 0 in term 0

Solution

I managed to solve it thanks to this thread. It was a mixture of changing some config and performing a full ES restart by deleting the node data.

Steps I did:

1) Access the logs to identify the error:

sudo -i
cd /var/log/elasticsearch
cat my-cluster.log

[2019-10-01T09:45:53,126][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [node-2] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2019-10-01T09:45:53,127][WARN ][r.suppressed             ] [node-2] path: /_cluster/health, params: {pretty=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.onTimeout(TransportMasterNodeAction.java:251) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:572) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.2.jar:7.3.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:835) [?:?]
[2019-10-01T09:45:59,754][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node-2] master not discovered yet: have discovered [{node-2}{Sb2fOZEKR42_4sB2XnmShg}{LgJ0DLojSay7KV2_cXgdpw}{<INSTANCE2-IP>}{<INSTANCE2-IP>:9300}{di}{ml.machine_memory=33728778240, xpack.installed=true, ml.max_open_jobs=20}, {node-1}{HqONwd3fQHSWZxxEtctcog}{cu5KW146S8-04oBBCqL3QA}{<INSTANCE1-IP>}{<INSTANCE1-IP>:9300}{dim}{ml.machine_memory=33728778240, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [<INSTANCE2-IP>:9300] from hosts providers and [] from last-known cluster state; node term 0, last-accepted version 0 in term 0

2) Change cluster.initial_master_nodes to use node names instead of node IPs:

cluster.initial_master_nodes: ["node-1"]

3) Delete the existing node data and restart Elasticsearch on both nodes

sudo rm -rf /var/lib/elasticsearch
sudo service elasticsearch restart