Search code examples
postgresqlpacemakercorosync

Pacemaker not able to start slave node on postgres-11


I have 2 nodes (named node03 and node04) in a Master-slave, hot standby setup using pacemaker to manage the cluster. Before a switchover, node04 was the master and 03 was the standby. After the switchover, I have been trying to bring the node04 back again as the slave node but not able to do it.

During the switchover I realized that someone had changed the config file and set the ignore_system_indexes parameter to true. I had to remove it and restart the postgres server manually. It was after this that the cluster started behaving oddly.

The node04 can be brought back up as slave node manually, i.e., if I start postgres instance manually and use the recovery.conf file.

Here are the files needed to understand the situation:

sudo crm_mon -A1f
Stack: corosync
Current DC: node03 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum

Node node04: standby
Online: [ node03 ]

Active resources:

 Resource Group: master-group
     vip-repli  (ocf::heartbeat:IPaddr2):       Started node03
     vip-master (ocf::heartbeat:IPaddr2):       Started node03
 Master/Slave Set: pgsql-cluster [pgsqlins]
     Masters: [ node03 ]

Node Attributes:
* Node node03:
    + master-pgsqlins                   : 1000
    + pgsqlins-data-status              : LATEST
    + pgsqlins-master-baseline          : 00008820DC000098
    + pgsqlins-status                   : PRI
* Node node04:
    + master-pgsqlins                   : -INFINITY
    + pgsqlins-data-status              : DISCONNECT
    + pgsqlins-status                   : STOP

Migration Summary:
* Node node03:
* Node node04:

recovery.conf

primary_conninfo = 'host=1xx.xx.xx.xx port=5432 user=replica application_name=node04 keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'rsync -a /Dxxxxx1/wal_archive/%f %p'
recovery_target_timeline = 'latest'
standby_mode = 'on'

cluster cib

sudo pcs cluster cib
<cib crm_feature_set="3.0.14" validate-with="pacemaker-2.10" epoch="269" num_updates="4" admin_epoch="0" cib-last-written="Mon Jun 28 15:13:35 2021" update-origin="node04" update-client="crmd" update-user="hacluster" have-quorum="1" dc-uuid="1">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
        <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
        <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/>
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.23-1.el7_9.1-9acf116022"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/>
        <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="pgcluster"/>
        <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1624860815"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="1" uname="node03">
        <instance_attributes id="nodes-1">
          <nvpair id="nodes-1-pgsqlins-data-status" name="pgsqlins-data-status" value="LATEST"/>
        </instance_attributes>
      </node>
      <node id="2" uname="node04">
        <instance_attributes id="nodes-2">
          <nvpair id="nodes-2-pgsqlins-data-status" name="pgsqlins-data-status" value="DISCONNECT"/>
          <nvpair id="nodes-2-standby" name="standby" value="on"/>
        </instance_attributes>
      </node>
    </nodes>
    <resources>
      <group id="master-group">
        <primitive class="ocf" id="vip-repli" provider="heartbeat" type="IPaddr2">
          <instance_attributes id="vip-repli-instance_attributes">
            <nvpair id="vip-repli-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/>
            <nvpair id="vip-repli-instance_attributes-ip" name="ip" value="1xx.xx.xx.xx"/>
            <nvpair id="vip-repli-instance_attributes-nic" name="nic" value="eth2"/>
          </instance_attributes>
          <operations>
            <op id="vip-repli-monitor-interval-10s" interval="10s" name="monitor" timeout="20s"/>
            <op id="vip-repli-start-interval-0s" interval="0s" name="start" timeout="20s"/>
            <op id="vip-repli-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
          </operations>
        </primitive>
        <primitive class="ocf" id="vip-master" provider="heartbeat" type="IPaddr2">
          <instance_attributes id="vip-master-instance_attributes">
            <nvpair id="vip-master-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/>
            <nvpair id="vip-master-instance_attributes-ip" name="ip" value="1x.xx.xxx.xxx"/>
            <nvpair id="vip-master-instance_attributes-nic" name="nic" value="eth1"/>
          </instance_attributes>
          <operations>
            <op id="vip-master-monitor-interval-10s" interval="10s" name="monitor" timeout="20s"/>
            <op id="vip-master-start-interval-0s" interval="0s" name="start" timeout="20s"/>
            <op id="vip-master-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
          </operations>
        </primitive>
      </group>
      <master id="pgsql-cluster">
        <primitive class="ocf" id="pgsqlins" provider="heartbeat" type="pgsql11">
          <instance_attributes id="pgsqlins-instance_attributes">
            <nvpair id="pgsqlins-instance_attributes-master_ip" name="master_ip" value="1xx.xx.xx.xx"/>
            <nvpair id="pgsqlins-instance_attributes-node_list" name="node_list" value="node03 node04"/>
            <nvpair id="pgsqlins-instance_attributes-pgctl" name="pgctl" value="/usr/pgsql-11/bin/pg_ctl"/>
            <nvpair id="pgsqlins-instance_attributes-pgdata" name="pgdata" value="/DPxxxx01/datadg/data"/>
            <nvpair id="pgsqlins-instance_attributes-pgport" name="pgport" value="5432"/>
            <nvpair id="pgsqlins-instance_attributes-primary_conninfo_opt" name="primary_conninfo_opt" value="keepalives_idle=60 keepalives_interval=5 keepalives_count=5"/>
            <nvpair id="pgsqlins-instance_attributes-psql" name="psql" value="/usr/pgsql-11/bin/psql"/>
            <nvpair id="pgsqlins-instance_attributes-rep_mode" name="rep_mode" value="sync"/>
            <nvpair id="pgsqlins-instance_attributes-repuser" name="repuser" value="replica"/>
            <nvpair id="pgsqlins-instance_attributes-restart_on_promote" name="restart_on_promote" value="true"/>
            <nvpair id="pgsqlins-instance_attributes-restore_command" name="restore_command" value="rsync -a /Dxxxxx01/wal_archive/%f %p"/>
          </instance_attributes>
          <operations>
            <op id="pgsqlins-demote-interval-0" interval="0" name="demote" on-fail="stop" timeout="60s"/>
            <op id="pgsqlins-methods-interval-0s" interval="0s" name="methods" timeout="5s"/>
            <op id="pgsqlins-monitor-interval-10s" interval="10s" name="monitor" on-fail="restart" timeout="60s"/>
            <op id="pgsqlins-monitor-interval-9s" interval="9s" name="monitor" on-fail="restart" role="Master" timeout="60s"/>
            <op id="pgsqlins-notify-interval-0" interval="0" name="notify" timeout="60s"/>
            <op id="pgsqlins-promote-interval-0" interval="0" name="promote" on-fail="restart" timeout="60s"/>
            <op id="pgsqlins-start-interval-0" interval="0" name="start" on-fail="restart" timeout="60s"/>
            <op id="pgsqlins-stop-interval-0" interval="0" name="stop" on-fail="block" timeout="60s"/>
          </operations>
        </primitive>
        <meta_attributes id="pgsql-cluster-meta_attributes">
          <nvpair id="pgsql-cluster-meta_attributes-master-node-max" name="master-node-max" value="1"/>
          <nvpair id="pgsql-cluster-meta_attributes-clone-max" name="clone-max" value="2"/>
          <nvpair id="pgsql-cluster-meta_attributes-notify" name="notify" value="true"/>
          <nvpair id="pgsql-cluster-meta_attributes-master-max" name="master-max" value="1"/>
          <nvpair id="pgsql-cluster-meta_attributes-clone-node-max" name="clone-node-max" value="1"/>
        </meta_attributes>
      </master>
    </resources>
    <constraints>
      <rsc_colocation id="colocation-master-group-pgsql-cluster-INFINITY" rsc="master-group" score="INFINITY" with-rsc="pgsql-cluster" with-rsc-role="Master"/>
      <rsc_order first="pgsql-cluster" first-action="promote" id="order-pgsql-cluster-master-group-INFINITY" score="INFINITY" symmetrical="false" then="master-group" then-action="start"/>
      <rsc_order first="pgsql-cluster" first-action="demote" id="order-pgsql-cluster-master-group-0" score="0" symmetrical="false" then="master-group" then-action="stop"/>
      <rsc_location id="cli-prefer-pgsql-cluster" rsc="pgsql-cluster" role="Started" node="node04" score="INFINITY"/>
    </constraints>
  </configuration>
  <status>
    <node_state id="1" uname="node03" in_ccm="true" crmd="online" crm-debug-origin="do_update_resource" join="member" expected="member">
      <transient_attributes id="1">
        <instance_attributes id="status-1">
          <nvpair id="status-1-pgsqlins-status" name="pgsqlins-status" value="PRI"/>
          <nvpair id="status-1-master-pgsqlins" name="master-pgsqlins" value="1000"/>
          <nvpair id="status-1-pgsqlins-master-baseline" name="pgsqlins-master-baseline" value="00008820DC000098"/>
        </instance_attributes>
      </transient_attributes>
      <lrm id="1">
        <lrm_resources>
          <lrm_resource id="vip-master" type="IPaddr2" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="vip-master_last_0" operation_key="vip-master_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="3:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:0;3:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="535" rc-code="0" op-status="0" interval="0" last-run="1624859077" last-rc-change="1624859077" exec-time="90" queue-time="0" op-digest="38fc1b2633211138e53cb349a5c147ff"/>
            <lrm_rsc_op id="vip-master_monitor_10000" operation_key="vip-master_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="4:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:0;4:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="536" rc-code="0" op-status="0" interval="10000" last-rc-change="1624859077" exec-time="72" queue-time="0" op-digest="4cbf56ab9e52c6f07a7be8cbb786451c"/>
          </lrm_resource>
          <lrm_resource id="vip-repli" type="IPaddr2" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="vip-repli_last_0" operation_key="vip-repli_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="1:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:0;1:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="532" rc-code="0" op-status="0" interval="0" last-run="1624859077" last-rc-change="1624859077" exec-time="127" queue-time="0" op-digest="dd04ed3322c75b7bab13c5bea56dbe77"/>
            <lrm_rsc_op id="vip-repli_monitor_10000" operation_key="vip-repli_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="2:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:0;2:433:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="534" rc-code="0" op-status="0" interval="10000" last-rc-change="1624859077" exec-time="55" queue-time="0" op-digest="c76770c29a91fb082fdf1fdd8b0469c3"/>
          </lrm_resource>
          <lrm_resource id="pgsqlins" type="pgsql11" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="pgsqlins_last_0" operation_key="pgsqlins_promote_0" operation="promote" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="12:432:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:0;12:432:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="530" rc-code="0" op-status="0" interval="0" last-run="1624859073" last-rc-change="1624859073" exec-time="3307" queue-time="0" op-digest="2f51441ed087061eb68745fd8157ddb6"/>
            <lrm_rsc_op id="pgsqlins_monitor_9000" operation_key="pgsqlins_monitor_9000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="13:433:8:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:8;13:433:8:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node03" call-id="533" rc-code="8" op-status="0" interval="9000" last-rc-change="1624859078" exec-time="497" queue-time="1" op-digest="978aa48a7da35944c793e174dbee9a1d"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
    </node_state>
    <node_state id="2" uname="node04" in_ccm="true" crmd="online" crm-debug-origin="do_update_resource" join="member" expected="member">
      <lrm id="2">
        <lrm_resources>
          <lrm_resource id="vip-repli" type="IPaddr2" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="vip-repli_last_0" operation_key="vip-repli_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="4:1:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:7;4:1:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node04" call-id="5" rc-code="7" op-status="0" interval="0" last-run="1624600624" last-rc-change="1624600624" exec-time="65" queue-time="0" op-digest="dd04ed3322c75b7bab13c5bea56dbe77"/>
          </lrm_resource>
          <lrm_resource id="vip-master" type="IPaddr2" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="vip-master_last_0" operation_key="vip-master_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="5:1:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:7;5:1:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node04" call-id="9" rc-code="7" op-status="0" interval="0" last-run="1624600624" last-rc-change="1624600624" exec-time="62" queue-time="0" op-digest="38fc1b2633211138e53cb349a5c147ff"/>
          </lrm_resource>
          <lrm_resource id="pgsqlins" type="pgsql11" class="ocf" provider="heartbeat">
            <lrm_rsc_op id="pgsqlins_last_0" operation_key="pgsqlins_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.14" transition-key="4:436:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" transition-magic="0:7;4:436:7:54755ae3-42a4-477c-ae37-8ae8bfbc1f04" exit-reason="" on_node="node04" call-id="192" rc-code="7" op-status="0" interval="0" last-run="1624860816" last-rc-change="1624860816" exec-time="178" queue-time="0" op-digest="2f51441ed087061eb68745fd8157ddb6"/>
          </lrm_resource>
        </lrm_resources>
      </lrm>
      <transient_attributes id="2">
        <instance_attributes id="status-2">
          <nvpair id="status-2-pgsqlins-status" name="pgsqlins-status" value="STOP"/>
          <nvpair id="status-2-master-pgsqlins" name="master-pgsqlins" value="-INFINITY"/>
        </instance_attributes>
      </transient_attributes>
    </node_state>
  </status>
</cib>

If I try to unstandby node04, it first demotes node03 and then tries to bring node04 up, although node04 doesn't come up. I tried bringing only node04 up alone, but that also fails. However, if I try to bring node04 up manually from the above situation, I am able to do it. It fails if I try to cleanup pgsqlins resource.

Here is the corosync.log

8 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Forwarding cib_apply_diff operation for section 'all' to all (origin=local/ci
badmin/2)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.251.32 2
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.0 b956759712580c1bfdffd25cbf4ab8e9
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       -- /cib/configuration/nodes/node[@id='2']/instance_attributes[@id='nodes-2']/
nvpair[@id='nodes-2-standby']
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @epoch=252, @num_updates=0
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=dci2pg
s04/cibadmin/2, version=0.252.0)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_file_backup:      Archived previous version as /var/lib/pacemaker/cib/cib-60.raw
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_file_write_with_digest:   Wrote version 0.252.0 of the CIB to disk (digest: 8b99629d323c923de59
2700bc4398c49)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_file_write_with_digest:   Reading cluster configuration file /var/lib/pacemaker/cib/cib.ZtvQXP
(digest: /var/lib/pacemaker/cib/cib.fh4Toy)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.252.0 2
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.1 (null)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @num_updates=1
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@i
d='pgsqlins']/lrm_rsc_op[@id='pgsqlins_last_0']:  @operation_key=pgsqlins_demote_0, @operation=demote, @transition-key=10:396:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @transi
tion-magic=-1:193;10:396:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1624852894, @last-rc-change=1624852894, @exec-time=0
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node03
/crmd/948, version=0.252.1)
Jun 28 13:01:34 [9294] node04.dc.japannext.co.jp      attrd:     info: attrd_peer_update:    Setting master-pgsqlins[node03]: 1000 -> -INFINITY from node03
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.252.1 2
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.2 (null)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @num_updates=2
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_att
ributes[@id='status-1']/nvpair[@id='status-1-master-pgsqlins']:  @value=-INFINITY
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node03
/attrd/211, version=0.252.2)
Jun 28 13:01:34 [9294] node04.dc.japannext.co.jp      attrd:     info: attrd_peer_update:    Setting pgsqlins-master-baseline[node03]: 00008820CC000098 -> (null) from node03
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.252.2 2
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.3 (null)
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       -- /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-pgsqlins-master-baseline']
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @num_updates=3
Jun 28 13:01:34 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node03/attrd/212, version=0.252.3)
Jun 28 13:01:35 [9294] node04.dc.japannext.co.jp      attrd:     info: attrd_peer_update:    Setting pgsqlins-status[node03]: PRI -> STOP from node03
.
.
.
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqlins']/lrm_rsc_op[@id='pgsqlins_last_0']:  @transition-magic=0:0;9:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @call-id=445, @rc-code=0, @op-status=0, @exec-time=471
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node03/crmd/956, version=0.252.11)
Jun 28 13:01:36 [9296] node04.dc.japannext.co.jp       crmd:     info: do_lrm_rsc_op:        Performing key=10:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04 op=pgsqlins_start_0
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Forwarding cib_modify operation for section status to all (origin=local/crmd/142)
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.252.11 2
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.12 (null)
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @num_updates=12
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqlins']/lrm_rsc_op[@id='pgsqlins_last_0']:  @operation_key=pgsqlins_start_0, @operation=start, @transition-key=12:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @transition-magic=-1:193;12:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @call-id=-1, @rc-code=193, @op-status=-1, @exec-time=0
Jun 28 13:01:36 [9293] node04.dc.japannext.co.jp       lrmd:     info: log_execute:  executing - rsc:pgsqlins action:start call_id:132
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node03/crmd/957, version=0.252.12)
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: --- 0.252.12 2
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       Diff: +++ 0.252.13 (null)
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib:  @num_updates=13
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_perform_op:       +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='pgsqlins']/lrm_rsc_op[@id='pgsqlins_last_0']:  @operation_key=pgsqlins_start_0, @operation=start, @transition-key=10:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @transition-magic=-1:193;10:397:0:54755ae3-42a4-477c-ae37-8ae8bfbc1f04, @call-id=-1, @rc-code=193, @op-status=-1, @last-run=1624852896, @last-rc-change=1624852896, @exec-time=0
Jun 28 13:01:36 [9291] node04.dc.japannext.co.jp        cib:     info: cib_process_request:  Completed cib_modify operation for section status: OK (rc=0, origin=node04/crmd/142, version=0.252.13)
Jun 28 13:01:37  pgsql11(pgsqlins)[9613]:    INFO: Set all nodes into async mode.
Jun 28 13:01:37  pgsql11(pgsqlins)[9613]:    INFO: PostgreSQL is down
Jun 28 13:01:37  pgsql11(pgsqlins)[9613]:    INFO: server starting
Jun 28 13:01:37  pgsql11(pgsqlins)[9613]:    INFO: PostgreSQL start command sent.
Jun 28 13:01:37  pgsql11(pgsqlins)[9613]:    WARNING: Can't get PostgreSQL recovery status. rc=2

My guess is that pacemaker is reading the setting before the switchover from /var/lib/pacemaker/cib and is using that to do these steps. Any help on how to reset it will be appreciated.


Solution

    • As mentioned in the question that pacemaker, upon putting node04 on unstandby, pacemaker was demoting node03 and tried to make node04 as the master. It would fail in this task and then later make node03 as the standalone master.

    • Since I was doubting that it was picking some old configuration from cib or pengine folder, I even destroyed the cluster on both nodes, removed pacemaker, pcs and corosync and re-installed all of them.

    • Even after all of that, the problem still persisted. Then I doubted that maybe the folder permission of the /var/lib/pgsql/ folder on node04 might not be right and started exploring it.

    • Only then did I realize that there was an old PGSQL.lock.bak file, which was dated June 11, which means it was older than the current PGSQL.lock file in node03, and thus pacemaker tried to promote node04 and would fail. Pacemaker would not show this as an error in any logs. There is no information about it even on the crm_mon output. Once I removed this file, it worked like a charm.

    TLDR;

    • Check if there are any PGSQL.lock.bak or any other unnecessary files in the /var/lib/pgsql/tmp folder and remove them before starting pacemaker again.