Search code examples
azureterraformterraform-provider-azureazure-monitor

Having one Azure Monitor Alert Rule for all severities


We use Azure Monitor to flag issues in our environments, such as VMs with disks nearing capacity. Typically we have 3 different thresholds with related severities; e.g.

  • Critical: (free disk space) <= 1%
  • Error: 1% < (free disk space) <= 5%
  • Warning: 5% < (free disk space) <= 10%

We define these alerts in Terraform, so the implementation for one of these looks something like this (I've removed a bunch of bits to give the gist):

resource "azurerm_monitor_scheduled_query_rules_alert_v2" "azure_tf_monitor_alert" {
  name                = "VM Free Disk Space - ERROR"
  severity            = 1 # error
  criteria {
    threshold               = 0
    operator                = "GreaterThan"
    query                   = <<-QUERY
      Perf
      | where TimeGenerated > ago(1h)
      | where ObjectName in~ ('LogicalDisk', 'Logical Disk') and CounterName =~ '% Free Space'  
      | where not (InstanceName in~ ("_Total" , "total" )) and not (InstanceName startswith "/snap")
      | summarize arg_max(TimeGenerated, CounterValue) by Computer, InstanceName, _ResourceId
      | project Computer, InstanceName, _ResourceId, CounterValue, TimeGenerated 
      | where CounterValue > 1 and CounterValue <= 5
      QUERY
      # etc...
  }
  # etc...
}

We then have almost the same rule 3 times; the only differences being:

name ending severity last line of query
WARNING 2 where CounterValue > 5 and CounterValue <= 10
ERROR 1 where CounterValue > 1 and CounterValue <= 5
CRITICAL 0 where CounterValue <= 1

This feels inefficient - we could easily have 1 rule called VM Free Disk Space and amend the SQL to return a value for Severity by having the last line of the query be something like this:

| where CounterValue <= 10         // ignore the healthy stuff
| extend Severity = case (
          CounterValue <= 1,  0,   // Critical
          CounterValue <= 5,  1,   // Error
          2                        // Warning by default
  )

However, so far as I can tell there's no way to take the severity value produced by the query and apply it to the rule's severity setting.

The terraform does allow us to have a threshold setting, so whilst I have this set to 0 meaning if we get any matches, go to the fired state we could adjust our output to return the CounterValue directly then have different criteria blocks with different thresholds; but from what I can tell we can only have 1 criteria block and besides, severity isn't set within that block but is in the parent resource.

Obviously we can put this terraform in a module and create 3 rules from a single definition (in fact, we have); but this feels inefficient / suggests to me that MS is running the same query 3 times just to handle the 3 scenarios despite it being essentially the same query...

Is there a better approach that we're missing?


Solution

  • If you need to use several severity levels then you need to create an alert rule for each one.

    In order to reduce code duplication, you can use a module (as you already mentioned) or, as an alternative, a template file for the queries like the following:

    my-module/templates/alert-rule.tpl

    Perf
    | where TimeGenerated > ago(1h)
    | where ObjectName in~ ('LogicalDisk', 'Logical Disk') and CounterName =~ '% Free Space'  
    | where not (InstanceName in~ ("_Total" , "total" )) and not (InstanceName startswith "/snap")
    | summarize arg_max(TimeGenerated, CounterValue) by Computer, InstanceName, _ResourceId
    | project Computer, InstanceName, _ResourceId, CounterValue, TimeGenerated 
    | where ${condition}
    

    It is possible to customize the condition in the last line: where ${condition}

    The below example shows how to use a map for each alert rule/severity. I'm using a null_resource for the sake of simplicity, but it should be easy enough to adapt this code to your case:

    main.tf

    locals {
      alert_rules = {
        "warning" = {
          name      = "VM Free Disk Space - WARNING"
          severity  = 2
          condition = "where CounterValue > 5 and CounterValue <= 10"
        },
        "error" = {
          name      = "VM Free Disk Space - ERROR"
          severity  = 1
          condition = "where CounterValue > 1 and CounterValue <= 5"
        },
        "critical" = {
          name      = "VM Free Disk Space - CRITICAL"
          severity  = 0
          condition = "where CounterValue <= 1"
        }
      }
    }
    
    resource "null_resource" "alert_rules" {
      for_each = local.alert_rules
    
      triggers = {
        name      = each.value.name
        severity  = each.value.severity
        condition = each.value.condition
    
        query = templatefile("${path.module}/templates/alert-rule.tpl", {
          condition = each.value.condition
        })
      }
    }
    

    Running terraform plan:

    Terraform used the selected providers to generate the following execution
    plan. Resource actions are indicated with the following symbols:
      + create
    
    Terraform will perform the following actions:
    
      # null_resource.alert_rules["critical"] will be created
      + resource "null_resource" "alert_rules" {
          + id       = (known after apply)
          + triggers = {
              + "condition" = "where CounterValue <= 1"
              + "name"      = "VM Free Disk Space - CRITICAL"
              + "query"     = <<-EOT
                    Perf
                    | where TimeGenerated > ago(1h)
                    | where ObjectName in~ ('LogicalDisk', 'Logical Disk') and CounterName =~ '% Free Space'  
                    | where not (InstanceName in~ ("_Total" , "total" )) and not (InstanceName startswith "/snap")
                    | summarize arg_max(TimeGenerated, CounterValue) by Computer, InstanceName, _ResourceId
                    | project Computer, InstanceName, _ResourceId, CounterValue, TimeGenerated 
                    | where where CounterValue <= 1
                EOT
              + "severity"  = "0"
            }
        }
    
      # null_resource.alert_rules["error"] will be created
      + resource "null_resource" "alert_rules" {
          + id       = (known after apply)
          + triggers = {
              + "condition" = "where CounterValue > 1 and CounterValue <= 5"
              + "name"      = "VM Free Disk Space - ERROR"
              + "query"     = <<-EOT
                    Perf
                    | where TimeGenerated > ago(1h)
                    | where ObjectName in~ ('LogicalDisk', 'Logical Disk') and CounterName =~ '% Free Space'  
                    | where not (InstanceName in~ ("_Total" , "total" )) and not (InstanceName startswith "/snap")
                    | summarize arg_max(TimeGenerated, CounterValue) by Computer, InstanceName, _ResourceId
                    | project Computer, InstanceName, _ResourceId, CounterValue, TimeGenerated 
                    | where where CounterValue > 1 and CounterValue <= 5
                EOT
              + "severity"  = "1"
            }
        }
    
      # null_resource.alert_rules["warning"] will be created
      + resource "null_resource" "alert_rules" {
          + id       = (known after apply)
          + triggers = {
              + "condition" = "where CounterValue > 5 and CounterValue <= 10"
              + "name"      = "VM Free Disk Space - WARNING"
              + "query"     = <<-EOT
                    Perf
                    | where TimeGenerated > ago(1h)
                    | where ObjectName in~ ('LogicalDisk', 'Logical Disk') and CounterName =~ '% Free Space'  
                    | where not (InstanceName in~ ("_Total" , "total" )) and not (InstanceName startswith "/snap")
                    | summarize arg_max(TimeGenerated, CounterValue) by Computer, InstanceName, _ResourceId
                    | project Computer, InstanceName, _ResourceId, CounterValue, TimeGenerated 
                    | where where CounterValue > 5 and CounterValue <= 10
                EOT
              + "severity"  = "2"
            }
        }
    
    Plan: 3 to add, 0 to change, 0 to destroy.