Search code examples
azureazure-functionsazure-servicebus-queues

Azure service bus trigger function app stops monitoring its queue


The problem we are experiencing is a service bus queue trigger function app, on a rare occasion (99.999% uptime), stops performing its job. The monitor aspect of the function app just stops working. The function app shows as running. We have found no errors to explain why the function app does not recognize new messages in the service bus queue (function app logs, application insights, service bus logs, etc.). Restarting the function app processes the messages in the queue.

We have seen this behavior in both our production and Uat/testing environments for ~2 of our service bus trigger function apps; however, we have other function apps that are the same trigger type that have yet to exhibit this behavior. The only difference between the two environments is that we use a premium service bus for production.

So, the million dollar question is why the function apps stop seeing new messages in the queues they are monitoring, until being restarted, given that they are not on a consumption plan and they are configured to always be on?

Production: Function App: Runtime Version: ~4 .Net Version: .NET 6 (LTS) Isolated (I know, we are going to 8 soon. :) ) Type - Service Bus Trigger Always On Setting - True Number of functions - 1 Storage account - specific only to the function app.

App Service Plan:
    Type -  P2v3
    # of Apps - 27

Service Bus:
    Type - Premium

Queue monitored:
    Session enabled: false

Current ticket exists with Microsoft, active for the last two days, but they do not have a solution at this point. Interaction with their support team members, thus far, have confirmed that our setup is coded and implemented/configured correctly.


Solution

  • Issue:

    • Microsoft performed a maintenance update.
    • We were not running at least two instances of the app service plan (recommended but with an added cost of course)
    • The update completed but it left the function app in an "unhealthy" state. The app was running but could no longer recognize and respond to new messages in the service bus queue.
    • Restarting the function app corrected the immediate problem.

    Why a problem occurred:

    • Running only one instance of the app service plan.
    • When Microsoft needs to make updates (whatever they may be) the process has no choice but to attempt the update on this singular plan.
    • So, there is the possibility that the update causes a problem, and that is what happened, in this case.

    Mitigation:

    Running with at least two instances of the app service plan:

    • The update process will only update one instance at a time, per a
      schedule Microsoft has in place, where it will ensure not more than
      one instance is being updated at the same time.
    • The load balancer will direct all traffic to the instance(s) that is/are not being updated.

    Using Azure Health Check:

    • Not a bad solution, but for a function app looking for messages in
      the queue, this is not enough because it may be a case where the
      processing of the messages are time sensitive enough that relying on the Health Check to move to another instance, because the current is considered unhealthy, is not optimal.

    Code Solution:

    • Create a timer trigger function app that checks the queue to see how long the most recent message has been there. Just a "peek", of course, and if the time is longer than what is considered normal then send an alert.

    Azure Solution:

    • Use Azure Monitor and configure an alert that determines if the most current message has been in the queue longer than a predefined time.

    Conclusion:

    So, in essence, our problem was that an update occurred via Microsoft maintenance, but our environment was not capable of dealing with whether a problem arose due to the update. Now, there is no way to always be 100% accurate in dealing with updates, but, by having at least two instances active we should be able to eliminate future problems, relative to maintenance updates. And, I am exploring using an Azure Monitor alert to inform us if messages stay in the queue longer than what we would expect. I'll explore the code solution if using Azure Monitor does not work for our case.

    Shout-out/Response to others:

    Finally, thank you Vivek, for your suggestions. In this case, a timer trigger, just to see if the queue monitoring function app was idle/not running, would not work because the function app is set to always be running, and it was active, but it just lost the ability to see new messages in the queue for processing.