Search code examples
azureazure-logic-appsazure-diagnosticsazure-logic-app-standard

Logic App Standard Diagnostic logs - Data loss


Background

Our team is making heavy use of logic apps (Standard) when building enterprise integrations. These integrations sometimes sends business critical messages, which we cannot lost. And if a message has been lost, which happens from time to time, we want to have the ability to track it. Therefore, we determined to setup Diagnostic Settings and send Tracked Properties in the workflows, which we then can monitor with Alert Rules that triggers notifications to an Action group etc.

Problem

We have now started to realise that a lot of logs are being lost. Out of 22k expected log points, about 4k-6k is lost and never end up in out Log Analytics Workspace. And this is bad, since we have no way to easily see what the workflows has processed; Of course we could look at individual workflow runs, but that would be way to time consuming. In short, we want to provide reliable logs to our operations team.

The documentation states the following:

Azure Monitor Resource Logs aren't 100% lossless. Resource Logs are based on a store and forward architecture designed to affordably move petabytes of data per day at scale. This capability includes built-in redundancy and retries across the platform, but doesn't provide transactional guarantees.

However, losing approximately 20% of the logs seems excessive, especially given the relatively small volume compared to the massive number of logs that can be ingested into Log Analytics.

Troubleshooting Steps

We have ensured that:

  • no sampling is done on the WorkflowRuntimeLogs.
  • the Logic Apps is not starving due to a cramped App Service Plan.
  • the expected workflow runs has completed, and the data reach the end destination. That is, we have successful runs without logs.

Questions

  • Has anyone else experienced similar behavior with Diagnostic Settings pointing to Log Analytics?
  • Are there any suggestions for alternative logging solutions that could provide more reliable logs?
  • What additional steps can we take to minimize log loss?

Happy to provide more information or clarification if needed. Any insights or suggestions would be greatly appreciated.


Solution

  • Log Analytics is not designed to be a primary functional reporting tool because it inherently works with the concept of sampling. Sampling is a method used to reduce the amount of data that needs to be processed and stored by only considering a subset of the entire dataset.

    While this is useful for performance and cost reasons, it can lead to incomplete data being available for reporting. Even though Azure Functions allows some control over sampling settings (for example InitialSamplingPercentage), it cannot guarantee 100% data retention. The same applies for logic apps.

    Reference to Azure Functions and Sampling, according to the Microsoft article on sampling in Azure Monitor. See: https://learn.microsoft.com/en-us/azure/azure-monitor/app/sampling-classic-api#how-sampling-works

    How sampling works

    For applications that don't work with a significant load, sampling isn't needed as these applications can usually send all their telemetry while staying within the quota, without causing data loss from throttling.

    Microsoft even has outlined instructions when a high sampling percentage is configured:

    What happens if I configure the sampling percentage to be too high?

    Configuring too high a sampling percentage (not aggressive enough) results in an insufficient reduction in the volume of the collected telemetry. You can still experience telemetry data loss related to throttling, and the cost of using Application Insights might be higher than you planned due to overage charges.

    This implies that for low-load applications, sampling might not be necessary, and data loss can be minimized. However, for applications with higher loads or critical logging needs, relying on this mechanism can still result in data loss due to inherent limitations and throttling mechanisms in place. This explains why you see a 20% loss in messages logged to log analytics.