Search code examples
ignite

Ignite TcpDiscoveryMulticastIpFinder not work: Node FAILED, the apache ignite server could not form the cluster


In the example configurations: https://github.com/apache/ignite/blob/master/examples/config/example-default.xml It use the TcpDiscoveryMulticastIpFinder but doesn't configure the multicast group like this:

                <!--<bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">-->
                <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.multicast.TcpDiscoveryMulticastIpFinder">
                    <property name="addresses">
                        <list>
                            <!-- In distributed environment, replace with actual host IP address. -->
                            <value>127.0.0.1:47500..47509</value>
                        </list>
                    </property>
                </bean>

But i find in the official document, https://apacheignite.readme.io/docs/cluster-config#section-multicast-based-discovery

It configured with multicast group,

<bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
  <property name="ipFinder">
    <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.multicast.TcpDiscoveryMulticastIpFinder">
      <property name="multicastGroup" value="228.10.10.157"/>
    </bean>
  </property>
</bean>

So, my question is in the example, it doesn't specify the multicastGroup property, it will use some default one? Or should i configure the multicastGroup, i have check my lab, should i use 228.1.2.4 as the multicastGroup address?

ip link show em1 | grep MULTICAST
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000

# ip maddress show
1:  lo
    inet  224.0.0.1
    inet6 ff02::1
    inet6 ff01::1
2:  em1
    link  01:00:5e:00:00:01
    link  33:33:00:00:00:01
    link  33:33:ff:e6:07:a8
    link  01:00:5e:01:02:04
    inet  228.1.2.4
    inet  224.0.0.1
    inet6 ff02::1:ffe6:7a8
    inet6 ff02::1
    inet6 ff01::1

In my environment i have 3 server nodes, but the server could not form the cluster, the topology show it always have node fail,

[10:59:34,424][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout [currentTimeout=10000, rmtAddr=/192.168.28.162:47500, rmtPort=47500]
[11:00:02,334][WARNING][disco-event-worker-#101][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=ca28bc89-8455-49dd-9e3a-bc4e22581125, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.28.163], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.28.163:47500], discPort=47500, order=20, intOrder=13, lastExchangeTime=1525186722970, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[11:00:41,674][WARNING][disco-event-worker-#101][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=42a3f2ef-4aa7-49d1-9987-05807efb4d46, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.28.184], sockAddrs=[/192.168.28.184:0, /0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0], discPort=0, order=25, intOrder=15, lastExchangeTime=1525186727940, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=true]

There is no traffic, and CPU, MEM usgage rate is very low, and the cluster has initially worked for the first time and for a while and failed later.

====================

I stop all the nodes, and try again, it still fails.

I start one server node, it worked, and then the second, and third, I could see the log, the topology update to 3 nodes, but quickly it got failed and reduced to 1 server only, both the 3 nodes reduced to 1 nodes:

[11:57:32,585][INFO][main][GridDiscoveryManager] Topology snapshot [ver=1, servers=1, clients=0, CPUs=32, offheap=25.0GB, heap=1.0GB]
[11:57:32,585][INFO][main][GridDiscoveryManager] Data Regions Configured:
[11:57:32,585][INFO][main][GridDiscoveryManager]   ^-- default [initSize=256.0 MiB, maxSize=25.1 GiB, persistenceEnabled=true]
[11:57:59,523][INFO][ignite-update-notifier-timer][GridUpdateNotifier] Your version is up to date.
[11:58:32,586][INFO][grid-timeout-worker-#71][IgniteKernal] 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=4769f8fa, uptime=00:01:00.008]
    ^-- H/N/C [hosts=1, nodes=1, CPUs=32]
    ^-- CPU [cur=0.03%, avg=0.15%, GC=0%]
    ^-- PageMemory [pages=0]
    ^-- Heap [used=99MB, free=89.83%, comm=981MB]
    ^-- Non heap [used=50MB, free=96.7%, comm=50MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
[11:59:03,122][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/192.168.28.162, rmtPort=51705]
[11:59:03,135][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/192.168.28.162, rmtPort=51705]
[11:59:03,136][INFO][tcp-disco-sock-reader-#6][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/192.168.28.162:51705, rmtPort=51705]
[11:59:08,174][INFO][tcp-disco-sock-reader-#6][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.28.162:51705, rmtPort=51705
[11:59:14,391][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/192.168.28.162, rmtPort=60747]
[11:59:14,391][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/192.168.28.162, rmtPort=60747]
[11:59:14,392][INFO][tcp-disco-sock-reader-#7][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/192.168.28.162:60747, rmtPort=60747]
[11:59:14,399][INFO][tcp-disco-sock-reader-#7][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.28.162:60747, rmtPort=60747
[11:59:18,428][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/192.168.28.162, rmtPort=48386]
[11:59:18,428][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/192.168.28.162, rmtPort=48386]
[11:59:18,428][INFO][tcp-disco-sock-reader-#8][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/192.168.28.162:48386, rmtPort=48386]
[11:59:18,452][INFO][disco-event-worker-#101][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=8c87d53c-ba5e-4bdc-800c-0a51f391fc38, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.28.162], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.28.162:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1525190343144, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[11:59:18,453][INFO][disco-event-worker-#101][GridDiscoveryManager] Topology snapshot [ver=2, servers=2, clients=0, CPUs=64, offheap=50.0GB, heap=2.0GB]
[11:59:18,453][INFO][disco-event-worker-#101][GridDiscoveryManager] Data Regions Configured:
[11:59:18,454][INFO][disco-event-worker-#101][GridDiscoveryManager]   ^-- default [initSize=256.0 MiB, maxSize=25.1 GiB, persistenceEnabled=true]
[11:59:32,589][INFO][grid-timeout-worker-#71][IgniteKernal] 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=4769f8fa, uptime=00:02:00.014]
    ^-- H/N/C [hosts=2, nodes=2, CPUs=64]
    ^-- CPU [cur=0.2%, avg=0.12%, GC=0%]
    ^-- PageMemory [pages=0]
    ^-- Heap [used=112MB, free=88.57%, comm=981MB]
    ^-- Non heap [used=50MB, free=96.67%, comm=51MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=5, qSize=0]
[12:00:13,117][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/192.168.28.163, rmtPort=41574]
[12:00:13,117][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/192.168.28.163, rmtPort=41574]
[12:00:13,117][INFO][tcp-disco-sock-reader-#9][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/192.168.28.163:41574, rmtPort=41574]
[12:00:13,122][INFO][tcp-disco-sock-reader-#9][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.28.163:41574, rmtPort=41574
[12:00:19,339][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/192.168.28.163, rmtPort=60878]
[12:00:19,340][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/192.168.28.163, rmtPort=60878]
[12:00:19,340][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/192.168.28.163:60878, rmtPort=60878]
[12:00:32,596][INFO][grid-timeout-worker-#71][IgniteKernal] 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=4769f8fa, uptime=00:03:00.020]
    ^-- H/N/C [hosts=2, nodes=2, CPUs=64]
    ^-- CPU [cur=0.03%, avg=0.1%, GC=0%]
    ^-- PageMemory [pages=0]
    ^-- Heap [used=119MB, free=87.82%, comm=981MB]
    ^-- Non heap [used=50MB, free=96.65%, comm=52MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
[12:00:34,361][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.28.163:60878, rmtPort=60878
[12:00:34,434][INFO][tcp-disco-sock-reader-#8][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.28.162:48386, rmtPort=48386
[12:00:39,572][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/192.168.28.163, rmtPort=50348]
[12:00:39,573][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/192.168.28.163, rmtPort=50348]
[12:00:39,573][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/192.168.28.163:50348, rmtPort=50348]
[12:00:41,880][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/192.168.28.163, rmtPort=44933]
[12:00:41,880][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/192.168.28.163, rmtPort=44933]
[12:00:41,881][INFO][tcp-disco-sock-reader-#12][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/192.168.28.163:44933, rmtPort=44933]
[12:00:41,885][INFO][tcp-disco-sock-reader-#12][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.28.163:44933, rmtPort=44933
[12:00:44,448][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout [currentTimeout=10000, rmtAddr=/192.168.28.162:47500, rmtPort=47500]
[12:00:44,451][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Failed to send message to next node [msg=TcpDiscoveryStatusCheckMessage [creatorNode=TcpDiscoveryNode [id=8c87d53c-ba5e-4bdc-800c-0a51f391fc38, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.28.162], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.28.162:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1525190412503, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], failedNodeId=null, status=1, super=TcpDiscoveryAbstractMessage [sndNodeId=8c87d53c-ba5e-4bdc-800c-0a51f391fc38, id=a9d4d6c1361-8c87d53c-ba5e-4bdc-800c-0a51f391fc38, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode [id=8c87d53c-ba5e-4bdc-800c-0a51f391fc38, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.28.162], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.28.162:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1525190343144, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], errMsg=Failed to send message to next node [msg=TcpDiscoveryStatusCheckMessage [creatorNode=TcpDiscoveryNode [id=8c87d53c-ba5e-4bdc-800c-0a51f391fc38, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.28.162], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.28.162:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1525190412503, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false], failedNodeId=null, status=1, super=TcpDiscoveryAbstractMessage [sndNodeId=8c87d53c-ba5e-4bdc-800c-0a51f391fc38, id=a9d4d6c1361-8c87d53c-ba5e-4bdc-800c-0a51f391fc38, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode [id=8c87d53c-ba5e-4bdc-800c-0a51f391fc38, order=2, addr=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.28.162], daemon=false]]]
[12:00:44,464][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi'
[12:00:44,468][INFO][disco-event-worker-#101][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=c096c28e-c1da-4f39-8c5d-db30e01826a7, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.28.163], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.28.163:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1525190406877, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[12:00:44,469][INFO][disco-event-worker-#101][GridDiscoveryManager] Topology snapshot [ver=3, servers=3, clients=0, CPUs=96, offheap=75.0GB, heap=3.0GB]
[12:00:44,469][INFO][disco-event-worker-#101][GridDiscoveryManager] Data Regions Configured:
[12:00:44,469][INFO][disco-event-worker-#101][GridDiscoveryManager]   ^-- default [initSize=256.0 MiB, maxSize=25.1 GiB, persistenceEnabled=true]
[12:00:44,474][WARNING][disco-event-worker-#101][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=8c87d53c-ba5e-4bdc-800c-0a51f391fc38, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.28.162], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.28.162:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1525190343144, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[12:00:44,475][INFO][disco-event-worker-#101][GridDiscoveryManager] Topology snapshot [ver=4, servers=2, clients=0, CPUs=64, offheap=50.0GB, heap=2.0GB]
[12:00:44,475][INFO][disco-event-worker-#101][GridDiscoveryManager] Data Regions Configured:
[12:00:44,475][INFO][disco-event-worker-#101][GridDiscoveryManager]   ^-- default [initSize=256.0 MiB, maxSize=25.1 GiB, persistenceEnabled=true]
[12:00:48,104][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/192.168.28.163, rmtPort=42252]
[12:00:48,105][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/192.168.28.163, rmtPort=42252]
[12:00:48,105][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/192.168.28.163:42252, rmtPort=42252]
[12:00:48,124][INFO][tcp-disco-sock-reader-#13][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.28.163:42252, rmtPort=42252
[12:00:54,338][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/192.168.28.163, rmtPort=51196]
[12:00:54,339][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/192.168.28.163, rmtPort=51196]
[12:00:54,339][INFO][tcp-disco-sock-reader-#14][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/192.168.28.163:51196, rmtPort=51196]
[12:00:54,342][INFO][tcp-disco-sock-reader-#14][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.28.163:51196, rmtPort=51196
[12:00:59,482][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.28.163:50348, rmtPort=50348
[12:01:00,568][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/192.168.28.163, rmtPort=41629]
[12:01:00,568][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/192.168.28.163, rmtPort=41629]
[12:01:00,569][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/192.168.28.163:41629, rmtPort=41629]
[12:01:00,571][INFO][tcp-disco-sock-reader-#15][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.28.163:41629, rmtPort=41629
[12:01:00,610][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/192.168.28.163, rmtPort=49138]
[12:01:00,611][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/192.168.28.163, rmtPort=49138]
[12:01:00,611][INFO][tcp-disco-sock-reader-#16][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/192.168.28.163:49138, rmtPort=49138]
[12:01:00,637][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems).
[12:01:00,637][INFO][tcp-disco-sock-reader-#16][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.28.163:49138, rmtPort=49138
[12:01:00,638][WARNING][disco-event-worker-#101][GridDiscoveryManager] Local node SEGMENTED: TcpDiscoveryNode [id=4769f8fa-e388-4208-a61c-6a7a44a70d74, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.28.161], sockAddrs=[Redis1/192.168.28.161:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1525190460629, loc=true, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[12:01:00,640][WARNING][disco-event-worker-#101][GridDiscoveryManager] Stopping local node according to configured segmentation policy.
[12:01:00,641][WARNING][disco-event-worker-#101][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=c096c28e-c1da-4f39-8c5d-db30e01826a7, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.28.163], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.28.163:47500], discPort=47500, order=3, intOrder=3, lastExchangeTime=1525190406877, loc=false, ver=2.4.0#20180305-sha1:aa342270, isClient=false]
[12:01:00,642][INFO][disco-event-worker-#101][GridDiscoveryManager] Topology snapshot [ver=5, servers=1, clients=0, CPUs=32, offheap=25.0GB, heap=1.0GB]

Solution

  • Default multicast group is 228.1.2.4.

    Have you tried to use org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder instead of multicast? If for some reasons multicast doesn't work properly in your environment, Discovery with static IP addresses will work anyway. Here is the example with static ip finder:

     <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                            <property name="addresses">
                                <list>
                                    <!-- In distributed environment, replace with actual host IP address. -->
                                    <value>127.0.0.1:47500..47509</value>
                                </list>
                            </property>
                        </bean>