Search code examples
logfilesnagios

How do I use Nagios to monitor a log file


We are using Nagios to monitor our network with great success. However, we have a syslog for critical application errors and while I set up check_log, it doesn't seem to work as well as monitering a device.

The issues are:

  • It only shows the last entry
  • There doesn't seem to be a way to acknowledge the critical error and return the monitor to a good state

Is nagios the wrong tool, or are we just not setting up the service monitering right?

Here are my entries

# log file
define command{
        command_name    check_log
        command_line    $USER1$/check_log -F /var/log/applications/appcrit.log -O /tmp/appcrit.log -q ?
}


# Define the log monitering service
define service{
        name                            logfile-check           ;
        use                             generic-service         ;
        check_period                    24x7                    ;
        max_check_attempts              1                       ;
        normal_check_interval           5                       ;
        retry_check_interval            1                       ;
        contact_groups                  admins                  ;
        notification_options            w,u,c,r                 ;
        notification_period             24x7                    ;
        register                        0                       ;
        }

define service{
        use                             logfile-check
        host_name                       localhost
        service_description             CritLogFile
        check_command                   check_log
}

Solution

  • As there are many ways to achieve a goal, there is also a nice plugin from Consol available: https://labs.consol.de/lang/en/nagios/check_logfiles/

    • supports regex
    • supports log rotation

    To use it, you need a cfg file, this is an example for oracle databases

    @searches = ({
      tag => 'oraalerts',
    options => 'sticky=28800',
      logfile => '/u01/app/oracle/diag/rdbms/davmdkp/DAVMDKP1/trace/alert_DAVMDKP1.log',
      criticalpatterns => [
          'ORA\-0*204[^\d]',        # error in reading control file
          'ORA\-0*206[^\d]',        # error in writing control file
          'ORA\-0*210[^\d]',        # cannot open control file
          'ORA\-0*257[^\d]',        # archiver is stuck
          'ORA\-0*333[^\d]',        # redo log read error
          'ORA\-0*345[^\d]',        # redo log write error
          'ORA\-0*4[4-7][0-9][^\d]',# ORA-0440 - ORA-0485 background process failure
          'ORA\-0*48[0-5][^\d]',
          'ORA\-0*6[0-3][0-9][^\d]',# ORA-6000 - ORA-0639 internal errors
          'ORA\-0*1114[^\d]',        # datafile I/O write error
          'ORA\-0*1115[^\d]',        # datafile I/O read error
          'ORA\-0*1116[^\d]',        # cannot open datafile
          'ORA\-0*1118[^\d]',        # cannot add a data file
          'ORA\-0*1122[^\d]',       # database file 16 failed verification check
          'ORA\-0*1171[^\d]',       # datafile 16 going offline due to error advancing checkpoint
          'ORA\-0*1201[^\d]',       # file 16 header failed to write correctly
          'ORA\-0*1208[^\d]',       # data file is an old version - not accessing current version
          'ORA\-0*1578[^\d]',        # data block corruption
          'ORA\-0*1135[^\d]',        # file accessed for query is offline
          'ORA\-0*1547[^\d]',        # tablespace is full
          'ORA\-0*1555[^\d]',        # snapshot too old
          'ORA\-0*1562[^\d]',        # failed to extend rollback segment
          'ORA\-0*162[89][^\d]',     # ORA-1628 - ORA-1632 maximum extents exceeded
          'ORA\-0*163[0-2][^\d]',
          'ORA\-0*165[0-6][^\d]',    # ORA-1650 - ORA-1656 tablespace is full
          'ORA\-16014[^\d]',      # log cannot be archived, no available destinations
          'ORA\-16038[^\d]',      # log cannot be archived
          'ORA\-19502[^\d]',      # write error on datafile
          'ORA\-27063[^\d]',         # number of bytes read/written is incorrect
          'ORA\-0*4031[^\d]',        # out of shared memory.
          'No space left on device',
          'Archival Error',
      ],
      warningpatterns => [
          'ORA\-0*3113[^\d]',        # end of file on communication channel
          'ORA\-0*6501[^\d]',         # PL/SQL internal error
          'ORA\-0*1140[^\d]',         # follows WARNING: datafile #20 was not in online backup mode
          'Archival stopped, error occurred. Will continue retrying',
      ]
    });