Search code examples
splunksplunk-querysplunk-formulasplunk-calculation

Splunk: How to Compute Incident Duration Records?


I have the following events in Splunk:

_time                           Agent_Hostname      alarm               status
2020-08-23T03:04:05.000-0700    m50-ups.a_domain    upsAlarmOnBypass    raised
2020-08-23T03:07:16.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
2020-08-23T03:07:16.000-0700    m50-ups.a_domain    upsAlarmInputBad    raised
2020-08-23T03:07:39.000-0700    m50-ups.a_domain    upsAlarmOnBypass    raised
2020-08-23T03:07:39.000-0700    m50-ups.a_domain    upsAlarmLowBattery  raised
2020-08-23T03:08:17.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
2020-08-23T03:09:24.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
2020-08-23T03:10:31.000-0700    m50-ups.a_domain    upsAlarmOnBattery   cleared
2020-08-23T03:10:32.000-0700    m50-ups.a_domain    upsAlarmInputBad    cleared
2020-08-23T03:11:12.000-0700    m50-ups.a_domain    upsAlarmLowBattery  cleared
2020-08-23T03:19:06.000-0700    m50-ups.a_domain    upsAlarmInputBad    raised
2020-08-23T03:19:06.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
2020-08-23T03:19:13.000-0700    m50-ups.a_domain    upsAlarmLowBattery  raised
2020-08-23T03:20:10.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
2020-08-23T03:21:16.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
2020-08-23T03:22:22.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
2020-08-23T03:23:29.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
2020-08-23T03:24:28.000-0700    m50-ups.a_domain    upsAlarmInputBad    cleared
2020-08-23T03:24:28.000-0700    m50-ups.a_domain    upsAlarmOnBattery   cleared
2020-08-23T03:25:09.000-0700    m50-ups.a_domain    upsAlarmLowBattery  cleared
2020-08-23T03:25:58.000-0700    m50-ups.a_domain    upsAlarmOnBypass    cleared

My problem is how to compute records of incidents' duration for each host and each alarm type, for example, from the above events I'd have the following through algorithm not just hard-codini the values in the particular examlpe:

start                        end                          Agent_Hostname   alarm
2020-08-23T03:04:05.000-0700 2020-08-23T03:25:58.000-0700 m50-ups.a_domain upsAlarmOnBypass
2020-08-23T03:07:16.000-0700                              m50-ups.a_domain upsTrapOnBattery
2020-08-23T03:07:16.000-0700 2020-08-23T03:24:28.000-0700 m50-ups.a_domain upsAlarmInputBad
2020-08-23T03:07:39.000-0700 2020-08-23T03:25:09.000-0700 m50-ups.a_domain upsAlarmLowBattery

where start is the earliest time when an alarm for a host is first raised, and end is the time when the same alarm/host is cleared.

My second problem is how to find the biggest span of duration among those enclosed spans, ignoring those without end time.

My question is how I can achieve within the framework of Splunk?


Solution

  • The transaction command can handle most of that. The only I can't get it to do is display outstanding alarms.

    | makeresults 
    | eval _raw="time                            Agent_Hostname      alarm               status
    2020-08-23T03:04:05.000-0700    m50-ups.a_domain    upsAlarmOnBypass    raised
    2020-08-23T03:07:16.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
    2020-08-23T03:07:16.000-0700    m50-ups.a_domain    upsAlarmInputBad    raised
    2020-08-23T03:07:39.000-0700    m50-ups.a_domain    upsAlarmOnBypass    raised
    2020-08-23T03:07:39.000-0700    m50-ups.a_domain    upsAlarmLowBattery  raised
    2020-08-23T03:08:17.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
    2020-08-23T03:09:24.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
    2020-08-23T03:10:31.000-0700    m50-ups.a_domain    upsAlarmOnBattery   cleared
    2020-08-23T03:10:32.000-0700    m50-ups.a_domain    upsAlarmInputBad    cleared
    2020-08-23T03:11:12.000-0700    m50-ups.a_domain    upsAlarmLowBattery  cleared
    2020-08-23T03:19:06.000-0700    m50-ups.a_domain    upsAlarmInputBad    raised
    2020-08-23T03:19:06.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
    2020-08-23T03:19:13.000-0700    m50-ups.a_domain    upsAlarmLowBattery  raised
    2020-08-23T03:20:10.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
    2020-08-23T03:21:16.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
    2020-08-23T03:22:22.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
    2020-08-23T03:23:29.000-0700    m50-ups.a_domain    upsTrapOnBattery    raised
    2020-08-23T03:24:28.000-0700    m50-ups.a_domain    upsAlarmInputBad    cleared
    2020-08-23T03:24:28.000-0700    m50-ups.a_domain    upsAlarmOnBattery   cleared
    2020-08-23T03:25:09.000-0700    m50-ups.a_domain    upsAlarmLowBattery  cleared
    2020-08-23T03:25:58.000-0700    m50-ups.a_domain    upsAlarmOnBypass    cleared" 
    | multikv forceheader=1 
    | eval _time=strptime(time,"%Y-%m-%dT%H:%M:%S.%3N%z")
    | fields _time Agent_Hostname alarm status 
    ```Everything above just defines test data - Remove Before Flight```
    ```Omit the reverse command if events are in descending order (the default)```
    | reverse
    ```Set the start and end times based on status```
    | eval start=if(status="raised",_time, NULL), end=if(status="cleared",_time, NULL)
    ```Define transactions based on "raised/cleared" pairs within host and alarm names```
    | transaction Agent_Hostname alarm startswith="raised" endswith="cleared"
    ```Change duration display to hh:mm:ss```
    | fieldformat duration=tostring(duration,"duration")
    | table start end Agent_Hostname alarm duration