Search code examples
error-handlingpublishazure-service-fabric

ServiceFabric: Service does not exist during deployment


I have an existing system using service fabric. Everything is fine except during a service publish the service is unavailable and any resolutions return an error.

This is expected however it would be nice if during this time instead the calls just waited or timedout. During this time my error logs will sometimes fill up with 200K lines of the same error.

I want some code like the following however where would it go?

public async Task Execute(Func<Task> action)
{
    try
    {
        action()
            .ConfigureAwait(false);
    }
    catch (FabricServiceNotFoundException ex)
    {
        await Task.Delay(TimeSpan.FromSeconds(??))
            .ConfigureAwait(false);

        action()
            .ConfigureAwait(false);
    }

}

Error:

System.Fabric.FabricServiceNotFoundException: Service does not exist. ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071BCD
   at System.Fabric.Interop.NativeClient.IFabricServiceManagementClient6.EndResolveServicePartition(IFabricAsyncOperationContext context)
   at System.Fabric.FabricClient.ServiceManagementClient.ResolveServicePartitionEndWrapper(IFabricAsyncOperationContext context)
   at System.Fabric.Interop.AsyncCallOutAdapter2`1.Finish(IFabricAsyncOperationContext context, Boolean expectedCompletedSynchronously)
   --- End of inner exception stack trace ---
   at Microsoft.ServiceFabric.Services.Client.ServicePartitionResolver.ResolveHelperAsync(Func`5 resolveFunc, ResolvedServicePartition previousRsp, TimeSpan resolveTimeout, TimeSpan maxRetryInterval, CancellationToken cancellationToken, Uri serviceUri)
   at Microsoft.ServiceFabric.Services.Communication.Client.CommunicationClientFactoryBase`1.CreateClientWithRetriesAsync(ResolvedServicePartition previousRsp, TargetReplicaSelector targetReplicaSelector, String listenerName, OperationRetrySettings retrySettings, Boolean doInitialResolve, CancellationToken cancellationToken)
   at Microsoft.ServiceFabric.Services.Communication.Client.CommunicationClientFactoryBase`1.GetClientAsync(ResolvedServicePartition previousRsp, TargetReplicaSelector targetReplica, String listenerName, OperationRetrySettings retrySettings, CancellationToken cancellationToken)
   at Microsoft.ServiceFabric.Services.Remoting.V2.FabricTransport.Client.FabricTransportServiceRemotingClientFactory.GetClientAsync(ResolvedServicePartition previousRsp, TargetReplicaSelector targetReplicaSelector, String listenerName, OperationRetrySettings retrySettings, CancellationToken cancellationToken)
   at Microsoft.ServiceFabric.Services.Communication.Client.ServicePartitionClient`1.GetCommunicationClientAsync(CancellationToken cancellationToken)
   at Microsoft.ServiceFabric.Services.Communication.Client.ServicePartitionClient`1.InvokeWithRetryAsync[TResult](Func`2 func, CancellationToken cancellationToken, Type[] doNotRetryExceptionTypes)
   at Microsoft.ServiceFabric.Services.Remoting.V2.Client.ServiceRemotingPartitionClient.InvokeAsync(IServiceRemotingRequestMessage remotingRequestMessage, String methodName, CancellationToken cancellationToken)
   at Microsoft.ServiceFabric.Services.Remoting.Builder.ProxyBase.InvokeAsyncV2(Int32 interfaceId, Int32 methodId, String methodName, IServiceRemotingRequestMessageBody requestMsgBodyValue, CancellationToken cancellationToken)
   at Microsoft.ServiceFabric.Services.Remoting.Builder.ProxyBase.ContinueWithResultV2[TRetval](Int32 interfaceId, Int32 methodId, Task`1 task)

Solution

  • As expected, Service Fabric have to shutdown the service to start the new version, this will cause a transient error like the one you've got.

    By default, the Remoting APIs already have a retry logic built-in, from the docs:

    The service proxy handles all failover exceptions for the service partition it is created for. It re-resolves the endpoints if there are failover exceptions (non-transient exceptions) and retries the call with the correct endpoint. The number of retries for failover exceptions is indefinite. If transient exceptions occur, the proxy retries the call.

    With that said, you should not require to add extra retry logic, maybe you should try adjust the OperationRetrySettings for a better handling of these retries.

    If does not solve the problem, and you still want to add the logic in your code, the simplest way to handle it is using a transient-fault-handling library like Polly, something like below:

       var policy = Policy
                     .Handle<FabricServiceNotFoundException>()
                     .WaitAndRetry(new[]
                     {
                       TimeSpan.FromSeconds(1),
                       TimeSpan.FromSeconds(2),
                       TimeSpan.FromSeconds(3)
                     });
    
       policy.Execute(() => DoSomething());
    

    In this sample, you do an exponential backoff between retries, if the number of calls is too big, I would recomend implement the circuit breaker approach instead.