Search code examples
bashevent-handlingnagios

Nagios event handler script in bash to restart service, if its not started dont restart next one until condition is met


Hi Stackoverflow community,

i need a help with bash script since i am new to it. What i am trying to accomplish, we have a windows server, sometimes it hits 90% memory, so whenever nagios catches it, we want to restart these services via nrpe. But before restarting all of the services, first service has to come up and once its up continue with the next service restart.

Another option is to stop all 4 services and then start them sequentially.

Here is script that i wrote:

case "$1" in
OK)
;;
WARNING)
;;
UNKNOWN)
;;
CRITICAL) ## DECISION ENGINE RESTART
echo -n "Restarting Decision Engine_1"
cat /usr/local/nagios/libexec/mail/DeServiceRestart.txt | mail -s "Restarting DE services" [email protected] -r Nagios@ATL-NM-01
/usr/local/nagios/libexec/check_nrpe -H "$2" -p 5666 -c restart_service -a DecisionEngine_1;
if /usr/local/nagios/libexec/check_nrpe -H "$2" -t 30 -c check_service -a DecisionEngine_1 'crit=not state_is_ok()' > OK:
then
echo -n "Restarting Decision Engine_2"
/usr/local/nagios/libexec/check_nrpe -H "$2" -p 5666 -c restart_service -a DecisionEngine_2
if /usr/local/nagios/libexec/check_nrpe -H "$2" -t 30 -c check_service -a DecisionEngine_2 'crit=not state_is_ok()' > OK:
then
echo -n "Restarting Decision Engine_3"
/usr/local/nagios/libexec/check_nrpe -H "$2" -p 5666 -c restart_service -a DecisionEngine_3
if /usr/local/nagios/libexec/check_nrpe -H "$2" -t 30 -c check_service -a DecisionEngine_3 'crit=not state_is_ok()' > OK:
then
echo -n "Restarting Decision Engine_4"
/usr/local/nagios/libexec/check_nrpe -H "$2" -p 5666 -c restart_service -a DecisionEngine_4
else
   echo " Restart is complete"
fi
;;
esac
exit 0

Not sure where i made a mistake, would appreciate any feedback.

Thanks!


Solution

  • All comments are in code. Double-check StopService function, because you not mentioned the way how to stop service, so I made it similarly.

    #!/bin/bash
    
    SERVICESTATE=$1;      #Common Check State (OK,WARNING,CRITICAL or UNKNOWN)
    Host=$2;              #HostName or IP
    SERVICESTATETYPE=$3;  #Hard or Soft service type
    
    TimeOut=3;            #Time (seconds) to wait service start/stop 
                          #before next service processing
                          #You could not make infinite TimeOut, because 
                          #nagios process will kill this handler if it 
                          #will run too long
    
    
    #Services is array with service names
    Services=(DecisionEngine_1 DecisionEngine_2 DecisionEngine_3 DecisionEngine_4)
    
    #add path to nagios plugins dir
    PATH=$PATH:/usr/local/nagios/libexec
    
    RestartService() {
       #function restarts services via NRPE.
       #Usage:  RestartService ServiceName
       echo -n " Restarting $1;"
       check_nrpe -H "${Host}" -p 5666 -c restart_service -a "$1" >/dev/null 2>&1
       return $?
    }
    
    StopService() {
       #function stops services via NRPE.
       #Usage: StopService ServiceName
       echo -n " Stopping $1;"
       check_nrpe -H "${Host}" -p 5666 -c stop_service -a "$1" >/dev/null 2>&1
       return $?
    }
    
    ServiceWait() {
       #function do continious checks service via NRPE, until success,
       #unsuccess check or TimeOut 
       #Usage:  ServiceWait ServiceName {start|stop}
       #start optin waits for success check
       #stop option waits for unsuccess check
       Logic="";
       [ "$2" == "start" ] && Logic="-eq"; #RC for start check should be 0
       [ "$2" == "stop" ] && Logic="-ne" ; #RC for stop check should NOT be 0
       [ -z "$Logic" ] && { echo "ServiceWait function usage error"; exit 19; }
       t=${TimeOut}
       while [ "$t" -ge 0 ]; do
          check_nrpe -H "${Host}" -p 5666 -t 30 \
                     -c check_service -a "$1" 'crit=not state_is_ok()' >/dev/null 2>&1
          RC=$?
          [ "$RC" $Logic 0 ] && { echo -n "CheckRC=$RC;"; return $RC; }      
                                  #success check, no need to wait anymore
          let t--
          sleep 1
       done
       echo -n "TimeOut; " 
       return 3
    }
    
    #check if script received zero params in $1, $2 and $3
    [ -z "${SERVICESTATE}" -o -z "${Host}" -o -z "${SERVICESTATETYPE}" ] && { 
        echo "Usage: $0 {OK|WARNING|UNKNOWN|CRITICAL} Hostname {SOFT|HARD}"; 
        exit 1; 
      }
    
    case "${SERVICESTATE}" in
       OK)
       ;;
       WARNING)
       ;;
       UNKNOWN)
       ;;
       CRITICAL) ## DECISION ENGINE RESTART
         #uncomment if you need @mail
         #cat /usr/local/nagios/libexec/mail/DeServiceRestart.txt | \
         # mail -s "Restarting DE services" [email protected] -r Nagios@ATL-NM-01
         RC=0
    
         if [ "$SERVICESTATETYPE" == "SOFT" ] ; then
            for (( i=0; i<${#Services[*]}; i++ )); do
               RestartService ${Services[$i]}
               ServiceWait ${Services[$i]} start
               RC=$?
               #if previous check failed, then do not try to do any restarts anymore
               [ "$RC" -ne 0 ] && break;         
               SuccessRestart+=(${Services[$i]})
            done
            echo "Restart is complete. ${SuccessRestart[*]} Return Code is ${RC}"
         elif [ "$SERVICESTATETYPE" == "HARD" ] ; then
            #Stop all services sequentially.
            for (( i=0; i<${#Services[*]}; i++ )); do
               StopService ${Services[$i]}
               #Here you need to experiment what to wait
               #May be it will be better to stay here for N seconds while
               #service is been stopped
               #rather then try to check service state
               ServiceWait ${Services[$i]} stop
               #sleep $TimeOut
            done
            #Start all services sequentially.
            for (( i=0; i<${#Services[*]}; i++ )); do
               RestartService ${Services[$i]}
               ServiceWait ${Services[$i]} start
               RC=$?
               #if previous check failed, then do not try to do any restarts anymore
               [ "$RC" -ne 0 ] && break;      
               SuccessRestart+=(${Services[$i]})
            done
         else
             echo "Unknown SERVICESTATETYPE $SERVICESTATETYPE option" 
             exit 20
         fi
       ;;
    esac
    exit 0