Search code examples
linuxcluster-computingheartbeatpacemaker

How can config "Clearing expired failcount" time with pacemaker


I have a problem with failcount when using pacemaker and corosync.

My /var/log/messages file:

Dec 23 22:19:36 node1 attrd[1922]: notice: attrd_perform_update: Sent update 81: fail-count-named=1

My lastest failcount at Dec 23 22:19:36.

But after few minutes:

Dec 23 22:34:47 node1 pengine[1923]: notice: unpack_rsc_op: Clearing expired failcount for named:0 on node1 
Dec 23 22:34:47 node1 pengine[1923]: notice: unpack_rsc_op: Re-initiated expired calculated failure named_last_failure_0 (rc=7, magic=0:7;21:32:0:f1d80836-3649-45c5-abd5-8c7d4ef5d7f9) on node1 

Failcount has been removed. It take about 15 minutes.

My cib.xml:

<nvpair id="rs-resource-stickiness" name="resource-stickiness" value="300"/>
<nvpair id="rs_defaults_migration-threshold" name="migration-threshold" value="3"/>
<nvpair id="rs_defaults_failure-timeout" name="failure-timeout" value="60s"/>

I don't know where failcount expire time stored and how can i config or remove it?


Solution

  • We can combine cluster-recheck-interval and failure-timeout when we need to config auto expire failcount. Using failure-timeout=0 when we want disable it.