Search code examples
c#azure-service-fabriccancellation-token

Simulate cancellation token request in RunAsync() at ServiceFabric


I am trying to use the FabricClient API in order to simulate a graceful failure (like partition/replica/instance restart), but for some reason the service keeps recovering.

The only time where it finally succeeds is when I manually delete the service from the Cluster UI, and then I see it is stuck since RunAsyc is stuck. (I have written a special dummy service which doesn't honor the cancellation token.)

These are my attempts:

foreach (var service in Services)
        {
            var partitions = FabricClient.QueryManager.GetPartitionListAsync(service.ServiceName).Result;
            foreach (var partition in partitions)
            {
                var operationGuid = Guid.NewGuid();
                restartOperationsIds.Add(operationGuid);
                var partitionId = partition.PartitionInformation.Id;

                FabricClient.FaultManager.RestartReplicaAsync(
                    ReplicaSelector.PrimaryOf(PartitionSelector.PartitionIdOf(service.ServiceName, partitionId)),
                    CompletionMode.Verify, CancellationToken.None);

                FabricClient.TestManager.StartPartitionRestartAsync(operationGuid,
                    PartitionSelector.PartitionIdOf(service.ServiceName, partitionId),
                    RestartPartitionMode.AllReplicasOrInstances, TimeSpan.FromMinutes(2));
            }
        }

RestartReplicaAsync doesn't do anything it seems, while StartPartitionRestartAsync causes the service to appear to restart, but then it succeeds again.


Solution

  • The cancellation token is cancelled in a few scenarios, and most these scenarios are mainly for maintenance reasons, they might be:

    • Upgrades: A service is shutdown to be updated, RunAsync() willbe called when restarted.
    • Scaling Down: Replicas are removed on a scale down and RunAsync is not called
    • Load Balancing: When SF need to move services around, RunAsync will be called.
    • Node Deactivation(Restart\RemoveData): SF will move services to other nodes, triggering the cancellation for graceful shutdown.
    • Remove Application\Service: When you remove a service or application from the cluster.

    There are some other events where the services are forcefully shutdown, and the token is not called, an example is when you call Restart-ServiceFabricDeployedCodePackage Restart-ServiceFabricPartition or Restart-ServiceFabricNode