Search code examples
azure-functionsappinsights

Azure Application Insights sampling changed sampling rate


One of our azure function v3 apps went from 200mb of app insight ingestion to ~18gb. We did not add any additional logging statements, change any sdks, or trigger any additional function executions. We do not specify an app insights sdk in our project so its using what Azure has installed. Running the recommended query below from Microsoft to show sampling percent makes it obvious something changed with adaptive sampling.

union requests,dependencies,pageViews,browserTimings,exceptions,traces
| where timestamp > ago(50d)
| summarize RetainedPercentage = 100/avg(itemCount) by bin(timestamp, 1h), itemType
|  order by timestamp, itemType

This is before the spike occurred enter image description here

This is after the spike occurred enter image description here

Here is the host.json

{
  "version": "2.0",
  "logging": {
    "logLevel": {
      "default": "Information",
      "Host.Triggers.DurableTask": "Warning",
      "DurableTask.AzureStorage": "Warning",
      "DurableTask.Core": "Warning"
    },
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "excludedTypes": "Request"
      }
    }
  },
  "extensions": {
    "eventHubs": {
      "batchCheckpointFrequency": 1,
      "eventProcessorOptions": {
        "maxBatchSize": 64,
        "prefetchCount": 128
      }
    },
    "durableTask": {
      "hubName": "FooDevicesTaskHub",
      "storageProvider": {
        "connectionStringName": "AzureWebJobsStorageDurable"
      },
      "tracing": {
        "traceInputsAndOutputs": false,
        "traceReplayEvents": false
      }
    },
    "serviceBus": {
      "messageHandlerOptions": {
        "maxConcurrentCalls": 1
      }
    }
  }
}

Here are the packages

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>netcoreapp3.1</TargetFramework>
    <AzureFunctionsVersion>v3</AzureFunctionsVersion>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="AutoMapper" Version="8.1.1" />
    <PackageReference Include="Azure.Storage.Blobs" Version="12.8.0" />
    <PackageReference Include="Azure.Storage.Files.DataLake" Version="12.2.2" />
    <PackageReference Include="Microsoft.Azure.Devices" Version="1.18.1" />
    <PackageReference Include="Microsoft.Azure.EventGrid" Version="3.2.0" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.CosmosDB" Version="3.0.7" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="2.5.1" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.EventGrid" Version="2.1.0" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.EventHubs" Version="4.1.1" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.ServiceBus" Version="4.3.0" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.Storage" Version="4.0.3" />
    <PackageReference Include="Microsoft.Extensions.Http" Version="3.1.7" />
    <PackageReference Include="Microsoft.NET.Sdk.Functions" Version="3.0.13" />
    <PackageReference Include="Microsoft.Azure.Functions.Extensions" Version="1.0.0" />
    <PackageReference Include="Polly" Version="7.2.1" />
    <PackageReference Include="Polly.Contrib.WaitAndRetry" Version="1.1.1" />
    <PackageReference Include="SendGrid" Version="9.24.2" />
    <PackageReference Include="System.Net.Http.Json" Version="5.0.0" />
  </ItemGroup>

Added more query results based on comment:

traces
| summarize sum(itemCount), count(), dcount(strcat(cloud_RoleName, "/")) by bin(timestamp, 30sec)
| render timechart

Before: enter image description here

After: enter image description here

Any ideas on what might cause this or what to look for? We have a ticket in with MS but they have been looking into it for weeks.


Solution

  • Adaptive sampling is on per app instance basis. So, if load decreased per node (either load decreased overall or you refactored your app {switched to some other plan, etc.} and now have way smaller instances, etc.) then this can explain the numbers.

    To check whether this is the case you can output the following columns:

    sum(itemCount), count(), dcount(strcat(cloud_RoleName, "/", cloud_RoleInstance), 4)