Search code examples
perlsnmpnagiosnfs

Nagios SNMP Process check hangs on stale nfs mount


I just got given this assignment at work and it is WAY over my head. We have a nagios monitoring script that goes and and runs a process check. We have had an NFS server that has been having issues lately and if it goes down, all of the machines that have it mounted start failing their process checks because the NFS mount is hung and has hung the SNMP check.

The check script is a perl nagios script that uses the NET::SNMP library. I'm pretty sure that it is just the generic nagios script. The script is found at http://nagios.manubulon.com/check_snmp_process.pl

Please help me understand what is going on.

EDIT: The nfs mount in question is for oracle RMAN backups that require the mount to be hard.


Solution

  • Fairly simple - NFS is designed to tolerate server reboots. NFS calls to a mounted file system when it's mounted hard will therefore block and wait for the server to respond. This is to ensure that no data is lost or processes are suspended - they simply 'stall' - which'll be the problem you're having.

    There's a mount option to nfs that avoids this problem - simply specify soft when mounting (either in fstab, or -o soft when doing it manually).

    Be warned though - you'll get errors when accessing the NFS mount. Most things will tolerate this scenario, but it's always possible that badly written scripts or programs will fall over.