Search code examples
cluster-computingpacemaker

Orphaned process when pacemaker kills main monitor script(LSB) due timeout


In our pacemaker + corosync cluster

Last updated: Thu Oct 22 21:16:33 2015 Last change: Thu Oct 22 17:25:13 2015 via cibadmin on aws015 Stack: corosync Current DC: aws015 (2887647247) - partition with quorum Version: 1.1.10-42f2063 4 Nodes configured 16 Resources configured

We have follow situation. We write python LSB script, that check status of some application, and make it as a resource:

primitive pm2_app_gardenscapesDynamo_lsb lsb:pm2_app_gardenscapesDynamo \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="60s" \ op monitor interval="30s" timeout="60s" on-fail="restart" \ meta failure-timeout="10s" migration-threshold="1"

This check is made by utility that can hung (LSB script launch that utility, and wait for reply from it). So when pacemaker reach timeout, it kill our python script, but hung utility still exists in memory, and doesn't dies.

Is it possible to prevent this situation?


Solution

  • You need to upgrade to pacemaker 1.1.12 or more recent.

    The reason this happens is because pacemaker starts resource agents in their own process group. When an operation times out, pacemaker (1.1.10) kills the RA only, leaving any child processes it might have started as "orphaned".

    Version 1.1.12 instead kills the entire process group.

    The relevant code is in lib/common/mainloop.c, function child_kill_helper