We use Azure Monitor to flag issues in our environments, such as VMs with disks nearing capacity. Typically we have 3 different thresholds with related severities; e.g.
(free disk space) <= 1%
1% < (free disk space) <= 5%
5% < (free disk space) <= 10%
We define these alerts in Terraform, so the implementation for one of these looks something like this (I've removed a bunch of bits to give the gist):
resource "azurerm_monitor_scheduled_query_rules_alert_v2" "azure_tf_monitor_alert" {
name = "VM Free Disk Space - ERROR"
severity = 1 # error
criteria {
threshold = 0
operator = "GreaterThan"
query = <<-QUERY
Perf
| where TimeGenerated > ago(1h)
| where ObjectName in~ ('LogicalDisk', 'Logical Disk') and CounterName =~ '% Free Space'
| where not (InstanceName in~ ("_Total" , "total" )) and not (InstanceName startswith "/snap")
| summarize arg_max(TimeGenerated, CounterValue) by Computer, InstanceName, _ResourceId
| project Computer, InstanceName, _ResourceId, CounterValue, TimeGenerated
| where CounterValue > 1 and CounterValue <= 5
QUERY
# etc...
}
# etc...
}
We then have almost the same rule 3 times; the only differences being:
name ending | severity | last line of query |
---|---|---|
WARNING |
2 |
where CounterValue > 5 and CounterValue <= 10 |
ERROR |
1 |
where CounterValue > 1 and CounterValue <= 5 |
CRITICAL |
0 |
where CounterValue <= 1 |
This feels inefficient - we could easily have 1 rule called VM Free Disk Space
and amend the SQL to return a value for Severity
by having the last line of the query be something like this:
| where CounterValue <= 10 // ignore the healthy stuff
| extend Severity = case (
CounterValue <= 1, 0, // Critical
CounterValue <= 5, 1, // Error
2 // Warning by default
)
However, so far as I can tell there's no way to take the severity
value produced by the query and apply it to the rule's severity setting.
The terraform does allow us to have a threshold
setting, so whilst I have this set to 0
meaning if we get any matches, go to the fired state
we could adjust our output to return the CounterValue
directly then have different criteria
blocks with different thresholds; but from what I can tell we can only have 1 criteria block and besides, severity
isn't set within that block but is in the parent resource.
Obviously we can put this terraform in a module and create 3 rules from a single definition (in fact, we have); but this feels inefficient / suggests to me that MS is running the same query 3 times just to handle the 3 scenarios despite it being essentially the same query...
Is there a better approach that we're missing?
If you need to use several severity levels then you need to create an alert rule for each one.
In order to reduce code duplication, you can use a module (as you already mentioned) or, as an alternative, a template file for the queries like the following:
Perf
| where TimeGenerated > ago(1h)
| where ObjectName in~ ('LogicalDisk', 'Logical Disk') and CounterName =~ '% Free Space'
| where not (InstanceName in~ ("_Total" , "total" )) and not (InstanceName startswith "/snap")
| summarize arg_max(TimeGenerated, CounterValue) by Computer, InstanceName, _ResourceId
| project Computer, InstanceName, _ResourceId, CounterValue, TimeGenerated
| where ${condition}
It is possible to customize the condition in the last line: where ${condition}
The below example shows how to use a map
for each alert rule/severity. I'm using a null_resource
for the sake of simplicity, but it should be easy enough to adapt this code to your case:
locals {
alert_rules = {
"warning" = {
name = "VM Free Disk Space - WARNING"
severity = 2
condition = "where CounterValue > 5 and CounterValue <= 10"
},
"error" = {
name = "VM Free Disk Space - ERROR"
severity = 1
condition = "where CounterValue > 1 and CounterValue <= 5"
},
"critical" = {
name = "VM Free Disk Space - CRITICAL"
severity = 0
condition = "where CounterValue <= 1"
}
}
}
resource "null_resource" "alert_rules" {
for_each = local.alert_rules
triggers = {
name = each.value.name
severity = each.value.severity
condition = each.value.condition
query = templatefile("${path.module}/templates/alert-rule.tpl", {
condition = each.value.condition
})
}
}
Running terraform plan
:
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# null_resource.alert_rules["critical"] will be created
+ resource "null_resource" "alert_rules" {
+ id = (known after apply)
+ triggers = {
+ "condition" = "where CounterValue <= 1"
+ "name" = "VM Free Disk Space - CRITICAL"
+ "query" = <<-EOT
Perf
| where TimeGenerated > ago(1h)
| where ObjectName in~ ('LogicalDisk', 'Logical Disk') and CounterName =~ '% Free Space'
| where not (InstanceName in~ ("_Total" , "total" )) and not (InstanceName startswith "/snap")
| summarize arg_max(TimeGenerated, CounterValue) by Computer, InstanceName, _ResourceId
| project Computer, InstanceName, _ResourceId, CounterValue, TimeGenerated
| where where CounterValue <= 1
EOT
+ "severity" = "0"
}
}
# null_resource.alert_rules["error"] will be created
+ resource "null_resource" "alert_rules" {
+ id = (known after apply)
+ triggers = {
+ "condition" = "where CounterValue > 1 and CounterValue <= 5"
+ "name" = "VM Free Disk Space - ERROR"
+ "query" = <<-EOT
Perf
| where TimeGenerated > ago(1h)
| where ObjectName in~ ('LogicalDisk', 'Logical Disk') and CounterName =~ '% Free Space'
| where not (InstanceName in~ ("_Total" , "total" )) and not (InstanceName startswith "/snap")
| summarize arg_max(TimeGenerated, CounterValue) by Computer, InstanceName, _ResourceId
| project Computer, InstanceName, _ResourceId, CounterValue, TimeGenerated
| where where CounterValue > 1 and CounterValue <= 5
EOT
+ "severity" = "1"
}
}
# null_resource.alert_rules["warning"] will be created
+ resource "null_resource" "alert_rules" {
+ id = (known after apply)
+ triggers = {
+ "condition" = "where CounterValue > 5 and CounterValue <= 10"
+ "name" = "VM Free Disk Space - WARNING"
+ "query" = <<-EOT
Perf
| where TimeGenerated > ago(1h)
| where ObjectName in~ ('LogicalDisk', 'Logical Disk') and CounterName =~ '% Free Space'
| where not (InstanceName in~ ("_Total" , "total" )) and not (InstanceName startswith "/snap")
| summarize arg_max(TimeGenerated, CounterValue) by Computer, InstanceName, _ResourceId
| project Computer, InstanceName, _ResourceId, CounterValue, TimeGenerated
| where where CounterValue > 5 and CounterValue <= 10
EOT
+ "severity" = "2"
}
}
Plan: 3 to add, 0 to change, 0 to destroy.