Search code examples
cluster-computinghigh-availabilityfailoverpacemakercorosync

pcs does not stop the failover resources in partner node before it starts them in main node while booting both machines at same time


I have recently started working on clusters, if you want any more info let me know .

I have a active-active HA cluster. Its designed to work during failover scenario.

I have Node1 and Node2 as a active-active cluster. pacemaker and corosync are used as cluster manger. Both NODES have 1 resource group with 3 resources each.

When Node1 goes down Node2 takes over its resources as expected. When Node1 is back online, pcs first stops node1 resources in node2 and then it starts them in node1 which is also expected and is working fine .

Issue : Am facing issue when both the nodes are booted at the same time.

scenario: When both the nodes are powered off and then powered on at same time. Lets say Node2 booted first, then PCS sees the node1 is still offline(still booting) and starts node1 resources in node2.Then it also starts its own resources in node2

so at same time when node1 is completely booted , its starts its own resource. Here the problem is before it starts its not stopping the node1 resources currently started(failover) in node2.

So at end node1 has its resources started in node1 and node2 also has both node1 & node2 resources started in node2.

The above scenario never happens when they are booted with time difference(15 min). Also it works fine when only one node is rebooted or powered off.

            # pcs property list --all
            Cluster Properties:
            batch-limit: 0
            cluster-delay: 60s
            cluster-infrastructure: cman
            cluster-recheck-interval: 15min
            crmd-finalization-timeout: 30min
            crmd-integration-timeout: 3min
            crmd-transition-delay: 0s
            dc-deadtime: 20s
            dc-version: 1.1.11-97629de
            default-action-timeout: 20s
            default-resource-stickiness: 0
            election-timeout: 2min
            enable-startup-probes: true
            expected-quorum-votes: 2
            is-managed-default: true
            last-lrm-refresh: 1565098302
            load-threshold: 80%
            maintenance-mode: false
            migration-limit: -1
            no-quorum-policy: ignore
            node-action-limit: 0
            node-health-green: 0
            node-health-red: -INFINITY
            node-health-strategy: none
            node-health-yellow: 0
            pe-error-series-max: -1
            pe-input-series-max: 4000
            pe-warn-series-max: 5000
            placement-strategy: default
            remove-after-stop: false
            shutdown-escalation: 20min
            start-failure-is-fatal: true
            startup-fencing: true
            stonith-action: reboot
            stonith-enabled: false
            stonith-timeout: 60s
            stop-all-resources: false
            stop-orphan-actions: true
            stop-orphan-resources: true
            symmetric-cluster: false

Solution

  • I was able fix this issue by using pcs 0.9.155 version. The older pcs version had this bug when simultaneous reboot happened.