In our pacemaker + corosync cluster
Last updated: Thu Oct 22 21:16:33 2015
Last change: Thu Oct 22 17:25:13 2015 via cibadmin on aws015
Stack: corosync
Current DC: aws015 (2887647247) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
16 Resources configured
We have follow situation. We write python LSB script, that check status of some application, and make it as a resource:
primitive pm2_app_gardenscapesDynamo_lsb lsb:pm2_app_gardenscapesDynamo \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="60s" \
op monitor interval="30s" timeout="60s" on-fail="restart" \
meta failure-timeout="10s" migration-threshold="1"
This check is made by utility that can hung (LSB script launch that utility, and wait for reply from it). So when pacemaker reach timeout, it kill our python script, but hung utility still exists in memory, and doesn't dies.
Is it possible to prevent this situation?
You need to upgrade to pacemaker 1.1.12 or more recent.
The reason this happens is because pacemaker starts resource agents in their own process group. When an operation times out, pacemaker (1.1.10) kills the RA only, leaving any child processes it might have started as "orphaned".
Version 1.1.12 instead kills the entire process group.
The relevant code is in lib/common/mainloop.c, function child_kill_helper