I've got a stateful service running in a Service Fabric cluster that I now know fails to honor a cancellation token passed into it. My fault.
I'm ready to release the fix, but during the upgrade process, I'm expecting the service replica on the faulty primary node to get stuck since it won't honor the token passed in.
I can use Restart-ServiceFabricDeployedCodePackage
or even Restart-ServiceFabricNode
to manually take down the stuck replica, but that will result in a brief service interruption during the upgrade process.
Is there any way to release this fix with zero downtime?
This is not possible for a stateful service using the Service Fabric infrastructure, you will need to have downtime on the upgrade. Once you have a version that supports the cancellation token then you will be fine.
That said, depending on the use of the state, and if you have a load balancer between your clients and the service, you can stand up another service instance on the new fixed version and use the load balancer to drain your traffic across to then new version, upgrade the old, drain back to it and then drop the second service you created. This will allow for a zero downtime scenario.