Search code examples
.netazureasynchronoustimeoutapplication-shutdown

Graceful shutdown of Azure worker role


Let's consider a worker role that:

  1. Hosts a WCF server
  2. Listens to a few Azure Storage Queues and Service Bus queues

The processing methods perform some Azure Storage I/O, HttpClient calls to external APIs and Entity Framework calls. Now I want my worker role to gracefully shutdown so all pending operations are finished or cancelled in a managed manner:

  1. Stop accepting any incoming requests once RoleEntryPoint.OnStop() is triggered. Does Azure make it for me? If not how do I enforce it?
  2. Allow N seconds for any pending operation to complete
  3. After N seconds cancel any operations left. The cancellation must not exceed M seconds so that N + M < 5 minutes. I believe 5 minutes is a guaranteed time Azure runtime will wait after it triggered OnStop() and before it terminates the process.

I'm imaging it something like this:

public override void Run() {
   // create a cancellation token source
   try {
     // pass the token to all processing/listening routines
   }
   catch (Exception e) { }
}

public override void OnStop() { 
   try {
      // trigger the cancellation token source
   } 
   catch (Exception e) { }
}

The naive sample above assumes that all my processing routines are async top to bottom (to EF/HttpClient calls). If it's the way to go I need a working example that takes care of the preconditions (WCF host, Queue listeners).

The questions opened:

  1. How do I make sure no more incoming TCP requests are sent to my worker role after OnStop() is triggered? This is important to fit shutdown code into 5 minutes limit.
  2. How to find out concrete numbers for N and M considering all the stuff like WCF channel time outs, EF timeouts, etc. in the configuration file?
  3. Will it be even possible for synchronous code?

Solution

  • Stop accepting any incoming requests once RoleEntryPoint.OnStop() is triggered. Does Azure make it for me? If not how do I enforce it?

    As this official document mentioned about ServiceHost.close():

    The Close method allows any unfinished work to be completed before returning. For example, finish sending any buffered messages.

    For gracefully terminate WCF Service receiving new request but allow existing connections to continue, you could refer to this issue.

    For listening Service Bus queues, you could define a CancellationTokenSource object and invoke CancellationTokenSource.Cancel() once RoleEntryPoint.OnStop() is triggered.

    And check whether cancellation has been requested for CancellationTokenSource as follows:

    try
    {
        if (!_cancellationTokenSource.IsCancellationRequested)
        {
            //retrieve and process the message
        }
    }
    catch (Exception)
    {
        // Handle any message processing specific exceptions here
    }
    

    Allow N seconds for any pending operation to complete

    Per my understanding, I assumed that you could just call Task.Delay(TimeSpan.FromSeconds(N)).Wait() after you invoke CancellationTokenSource.Cancel() and terminate the WCF Service in the OnStop function. Then the pending operations would be discarded along with shutting the worker role instance down.

    How to find out concrete numbers for N and M considering all the stuff like WCF channel time outs, EF timeouts, etc. in the configuration file?

    I assumed that you could leverage Application Insights with your worker role to retrieve the metrics data and configure the reasonable value for N, in order to reduce the failed request rate and quickly let your VM restart and begin processing new requests. Also you could refer to this tutorial about handling Azure OnStop event.