Search code examples
azurealertkqlazure-log-analyticsazure-bicep

Azure Alerts - ScheduledQueryRule - Set Time Range in Query only


I am currently working on an Alert Rule in Log Analytics which is supposed to look back for 90 days, calculate a confidencelimit over that time and then compare today's value with the limit. If today's value exceeds the statistic, then the alert rule should fire.

Something like so (pseudo code):

let jobDurationTbl = latencyTbl
    | where name_s contains "latency_metrics" and TimeGenerated >= (now() - 90d)

let statisticsTbl = jobDurationTbl
    | summarize count(), upperConfidenceLimit = avg(jobDuration_ms) + 2 * stdev(jobDuration_ms) by serviceName_s

let alertTbl = latencyTbl
    | join kind=leftouter (jobDurationTbl) on serviceName_s
    | project
        serviceName_s,
        _ResourceId,
        TimeGenerated
    | join kind=leftouter (statisticsTbl) on serviceName_s
    | where jobDuration_ms > upperConfidenceLimit
    | where TimeGenerated == today;

Please note that I can not post the whole query here due to its length and complexity. The idea is that we have a table jobDurationTbl which contains the length of jobs in ms (jobDuration_ms) for each service. Then we create an alertTbl which should only return rows if the jobDuration_ms exceeds the upperConfidenceLimit calculated from the statisticsTbl

My problem is that the 90 days look back keeps getting overridden by the windowSize in the following line: TimeGenerated >= (now() - 90d). It is not clear what Log Analytics does, but I think it replaces the whole (now() - 90d) with the windowSize. I tried different syntaxes like:

YourTable
| where TimeGenerated >= ago(90d)

But still the time range gets overriden by the windowSize. Does someone know how the Microsoft.Insights scheduledQueryRule should be configured in such a case? I want to specify the windowing functionality solely in the query without the interference of any parameters. The evaluation period should be simply once a day.


Solution

  • Ok, I found a solution in the end. Thank you @Jahnavi you gave the missing clue with the binning logic. The final query with sample data looks as follows:

    let Logs = datatable (TimeGenerated:datetime, ServiceName:string, jobDuration:int, _ResourceId:string ) [
       "1/4/2023, 2:33:36.000 PM", "Service A", 8945849, "/subscriptions/**************/resourcegroups/rg-******-prd-ingestion-landing/providers/microsoft.databricks/workspaces/dbw-******-prd-ingld",
       "12/4/2023, 2:33:36.000 PM", "Service A", 14, "/subscriptions/**************/resourcegroups/rg-******-prd-ingestion-landing/providers/microsoft.databricks/workspaces/dbw-******-prd-ingld",
       "12/5/2023, 3:33:36.000 PM", "Service B", 50, "/subscriptions/**************/resourcegroups/rg-******-prd-ingestion-landing/providers/microsoft.databricks/workspaces/dbw-******-prd-ingld",
       "12/5/2023, 3:33:36.000 PM", "Service B", 800, "/subscriptions/**************/resourcegroups/rg-******-prd-ingestion-landing/providers/microsoft.databricks/workspaces/dbw-******-prd-ingld",
       "12/6/2023, 1:33:36.000 PM", "Service C", 34982334, "/subscriptions/**************/resourcegroups/rg-******-prd-ingestion-landing/providers/microsoft.databricks/workspaces/dbw-******-ingld",
       "1/5/2024, 2:33:36.000 PM", "Service B", 45, "/subscriptions/**************/resourcegroups/rg-******-prd-ingestion-landing/providers/microsoft.databricks/workspaces/dbw-******-prd-ingld",
       "1/5/2024, 3:33:36.000 PM", "Service B", 5000, "/subscriptions/**************/resourcegroups/rg-******-prd-ingestion-landing/providers/microsoft.databricks/workspaces/dbw-******-prd-ingld"
    ] ;
    
    let todaysRunTbl = Logs | where TimeGenerated >= startofday(now());
    //todaysRunTbl
    
    let statistiscsTbl = Logs
    | where TimeGenerated <= startofday(now())
    | where TimeGenerated > ago(90d)
    | summarize upperConfidenceLimit = avg(jobDuration) + 3 * stdev(jobDuration) by bin(TimeGenerated, 90d), ServiceName
    | project ServiceName, upperConfidenceLimit;
    
    
    let alertTbl = todaysRunTbl | join kind=inner statistiscsTbl on ServiceName
    | where jobDuration >= upperConfidenceLimit;
    alertTbl
    

    If today were the 1/5/2024, we would get the right result provided we have parameterized the alert correctly. The right result is the last row of the datatable. For reference, if we want to check every 15 minutes if there was an error in the last 1h, we would configure the alert as follows (bicep):

    param windowSize = 'PT1H'
    param evaluationFrequency = 'PT15M'
    
    param timeAggregation = 'Count'
    
    param operator = 'GreaterThanOrEqual'
    param threshold = 1
    
    param dimensions  = [
      {
        name: 'jobDuration'
        operator: 'Include'
        values: [ '*' ]
      }
      {
        name: 'upperConfidenceLimit'
        operator: 'Include'
        values: [ '*' ]
      }
    ]
    
    param numberOfEvaluationPeriods = 1
    param minFailingPeriodsToAlert = 1
    

    We would get an alert for every row returned by the query.