Search code examples
azureloggingkqlazure-alerts

How do I create an alert which fires when one of many machines fails to report a heartbeat?


Overall I'm trying to set up an azure alert to email me when a computer goes down by using the Heartbeat table.

Let's say I have 5 machines in my Azure subscription, and they each report once per minute to the table called Heartbeat, so it looks something like this:

Table Example

Currently, I can query "Heartbeat | where Computer == 'computer-name'| where TimeGenerated > ago(5m)" and figure out when one computer has not been reporting in the last 5 minutes, and is down (thank you to this great article for that query).

I am not very experienced with any query language, so I am wondering if it is possible to have 1 query which can check to see if there was ANY computer which stopped sending it's logs over the last 5-10 minute period, and thus would be down. Azure uses KQL, or Kusto Query Language for it's queries and there is documentation in the link above.

Thanks for the help


Solution

  • one option is to calculate the max report time for each computer, then filter to the ones whose max report time is older than 5 minutes ago:

    let all_computers_lookback = 12h
    ;
    let monitor_looback = 5m
    ;
    Heartbeat
    | where TimeGenerated > ago(all_computers_lookback)
    | summarize last_reported = max(TimeGenerated) by computer
    | where last_reported < ago(monitor_looback)
    

    another alternative:

    • the first part creates an "inventory" of all computers that reported at least once in the last (e.g. 12 hours).
    • the second part finds all computers that reported at least once in the last (e.g. 5 minutes)
    • the third and final part finds the difference between the two (i.e. all computers that didn't report in the last 5 minutes)

    Note: if you have more than 1M computers, you can use the join operator instead of the in() operator

    let all_computers_lookback = 12h
    ;
    let monitor_looback = 5m
    ;
    let all_computers = 
        Heartbeat
        | where TimeGenerated > ago(all_computers_lookback)
        | distinct computer
    ;
    let reporting_computers = 
        Heartbeat
        | where TimeGenerated > ago(monitor_looback)
        | distinct computer
    ;
    let non_reporting_computers =
        all_computers
        | where computer !in(reporting_computers)
    ;
    non_reporting_computers