elasticsearch network-programming rackspace

Elasticsearch master node constantly connecting and disconnecting

I'm constantly getting these error messages in my logs:

[2015-11-10 13:52:03,037][WARN ][discovery.zen.ping.unicast] [ClusterUK Node 1] [11] failed send ping to [ClusterUK Node 1][x-eBYFoiRemOBK7egMHTRg][elasticuk1][inet[/172.24.32.10:9300]]{master=true}
org.elasticsearch.ElasticsearchIllegalStateException: can't add nodes to a stopped transport
    at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:746)
    at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:731)
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:216)
    at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:376)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
[2015-11-10 13:52:03,038][WARN ][discovery.zen.ping.unicast] [ClusterUK Node 1] [12] failed send ping to [ClusterUK Node 1][x-eBYFoiRemOBK7egMHTRg][elasticuk1][inet[/172.24.32.10:9300]]{master=true}
org.elasticsearch.ElasticsearchIllegalStateException: can't add nodes to a stopped transport
    at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:746)
    at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:731)
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:216)
    at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:376)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
[2015-11-10 13:52:03,038][WARN ][discovery.zen.ping.unicast] [ClusterUK Node 1] [12] failed send ping to [ClusterUK Node 1][x-eBYFoiRemOBK7egMHTRg][elasticuk1][inet[/172.24.32.10:9300]]{master=true}
org.elasticsearch.ElasticsearchIllegalStateException: can't add nodes to a stopped transport
    at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:746)
    at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:731)
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:216)
    at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:376)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
[2015-11-10 13:52:11,378][INFO ][transport                ] [ClusterUK Node 1] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/172.24.32.10:9300]}
[2015-11-10 13:52:11,394][INFO ][discovery                ] [ClusterUK Node 1] ClusterUK/FTiLxRmZQLyFtyap8JTj2w
[2015-11-10 13:52:14,498][INFO ][cluster.service          ] [ClusterUK Node 1] detected_master [ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}, added {[ClusterUK Client Node STG1][_JfbrXjFTzGD7BL7OTqbVA][Staging1][inet[/192.168.100.248:9300]]{data=false, master=false},[ClusterUK Node 3][rHJ486YyQHqKytG44fmC7g][elasticuk3][inet[/172.24.32.8:9300]]{master=true},[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}])
[2015-11-10 13:52:14,749][INFO ][http                     ] [ClusterUK Node 1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/172.24.32.10:9200]}
[2015-11-10 13:52:14,750][INFO ][node                     ] [ClusterUK Node 1] started
[2015-11-10 13:52:44,994][INFO ][discovery.zen            ] [ClusterUK Node 1] master_left [[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}], reason [do not exists on master, act as master failure]
[2015-11-10 13:52:44,996][WARN ][discovery.zen            ] [ClusterUK Node 1] master left (reason = do not exists on master, act as master failure), current nodes: {[ClusterUK Client Node STG1][_JfbrXjFTzGD7BL7OTqbVA][Staging1][inet[/192.168.100.248:9300]]{data=false, master=false},[ClusterUK Node 1][FTiLxRmZQLyFtyap8JTj2w][elasticuk1][inet[elasticuk1/172.24.32.10:9300]]{master=true},[ClusterUK Node 3][rHJ486YyQHqKytG44fmC7g][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}
[2015-11-10 13:52:44,996][INFO ][cluster.service          ] [ClusterUK Node 1] removed {[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true},}, reason: zen-disco-master_failed ([ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true})
[2015-11-10 13:52:48,047][INFO ][cluster.service          ] [ClusterUK Node 1] detected_master [ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}, added {[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}])
[2015-11-10 13:53:10,689][INFO ][cluster.service          ] [ClusterUK Node 1] removed {[ClusterUK Node 3][rHJ486YyQHqKytG44fmC7g][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}])
[2015-11-10 13:53:13,199][INFO ][cluster.service          ] [ClusterUK Node 1] added {[ClusterUK Node 3][rHJ486YyQHqKytG44fmC7g][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}])
[2015-11-10 13:53:35,963][INFO ][discovery.zen            ] [ClusterUK Node 1] master_left [[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}], reason [transport disconnected]
[2015-11-10 13:53:35,964][WARN ][discovery.zen            ] [ClusterUK Node 1] master left (reason = transport disconnected), current nodes: {[ClusterUK Client Node STG1][_JfbrXjFTzGD7BL7OTqbVA][Staging1][inet[/192.168.100.248:9300]]{data=false, master=false},[ClusterUK Node 1][FTiLxRmZQLyFtyap8JTj2w][elasticuk1][inet[elasticuk1/172.24.32.10:9300]]{master=true},[ClusterUK Node 3][rHJ486YyQHqKytG44fmC7g][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}
[2015-11-10 13:53:35,965][INFO ][cluster.service          ] [ClusterUK Node 1] removed {[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true},}, reason: zen-disco-master_failed ([ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true})
[2015-11-10 13:53:39,018][INFO ][cluster.service          ] [ClusterUK Node 1] detected_master [ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}, added {[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}])
[2015-11-10 13:54:03,581][INFO ][discovery.zen            ] [ClusterUK Node 1] master_left [[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}], reason [transport disconnected]
[2015-11-10 13:54:03,581][WARN ][discovery.zen            ] [ClusterUK Node 1] master left (reason = transport disconnected), current nodes: {[ClusterUK Client Node STG1][_JfbrXjFTzGD7BL7OTqbVA][Staging1][inet[/192.168.100.248:9300]]{data=false, master=false},[ClusterUK Node 1][FTiLxRmZQLyFtyap8JTj2w][elasticuk1][inet[elasticuk1/172.24.32.10:9300]]{master=true},[ClusterUK Node 3][rHJ486YyQHqKytG44fmC7g][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}
[2015-11-10 13:54:03,581][INFO ][cluster.service          ] [ClusterUK Node 1] removed {[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true},}, reason: zen-disco-master_failed ([ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true})
[2015-11-10 13:54:06,603][INFO ][cluster.service          ] [ClusterUK Node 1] detected_master [ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}, added {[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true},}, reason: zen-disco-receive(from master [[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}])
[2015-11-10 13:54:39,790][INFO ][discovery.zen            ] [ClusterUK Node 1] master_left [[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true}], reason [transport disconnected]
[2015-11-10 13:54:39,792][WARN ][discovery.zen            ] [ClusterUK Node 1] master left (reason = transport disconnected), current nodes: {[ClusterUK Client Node STG1][_JfbrXjFTzGD7BL7OTqbVA][Staging1][inet[/192.168.100.248:9300]]{data=false, master=false},[ClusterUK Node 1][FTiLxRmZQLyFtyap8JTj2w][elasticuk1][inet[elasticuk1/172.24.32.10:9300]]{master=true},[ClusterUK Node 3][rHJ486YyQHqKytG44fmC7g][elasticuk3][inet[/172.24.32.8:9300]]{master=true},}
[2015-11-10 13:54:39,792][INFO ][cluster.service          ] [ClusterUK Node 1] removed {[ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true},}, reason: zen-disco-master_failed ([ClusterUK Node 2][T5R_1SUwRu6Q4zZLMTbNlA][elasticuk2][inet[/172.24.32.5:9300]]{master=true})
[2015-11-10 13:54:42,366][ERROR][marvel.agent.exporter    ] [ClusterUK Node 1] remote target didn't respond with 200 OK response code [503 Service Unavailable]. content: [:)
��error�ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]��status$��]

That'd be my elasticsearch.yml file:

action.disable_delete_all_indices: true

cluster.name: ClusterUK

network.publish_host: "172.24.32.10"

discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["172.24.32.10", "172.24.32.5", "172.24.32.8"]

indices.fielddata.cache.size: 25%
indices.cluster.send_refresh_mapping: false

node.name: "ClusterUK Node 1" 
node.master: true
node.data: true

bootstrap.mlockall: true

In some cases it leave Elasticsearch not running as a service (few seconds).

This is currently running in Rackspace and I think there might be network issues involved (However, I'm binding to a specific IP address and use unicast).

There are 4 nodes running there (3 with master=true and data=true and one client node).

Can someone give me an insight on what's actually happening there? Version 1.7.3 (client node 1.7.1) on Windows Server.

I'm suspecting that issue comes from master left (reason = transport disconnected) and it's a split-brain, but how do I fix it?

Solution

I was able to find what was the issue. Elasticsearch doesn't tolerate TCP Offloading.

TCP offload engine is a function used in network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. By moving some or all of the processing to dedicated hardware, a TCP offload engine frees the system's main CPU for other tasks. However, TCP offloading has been known to cause some issues, and disabling it can help avoid these issues.

Disable TCP Offloading

In the Windows server, open the Control Panel and select Network Settings > Change Adapter Settings.

Right-click on each of the adapters (private and public), select Configure from the Networking menu, and then click the Advanced tab. The TCP offload settings are listed for the Citrix adapter.

Disable each of the following TCP offload options, and then click OK:
- IPv4 Checksum Offload
- Large Receive Offload
- Large Send Offload
- TCP Checksum Offload

This solved my issue.