Search code examples
azure-service-fabric

How to restart/recycle underlying VM from Service Fabric runtime


When app is in bad state I'd like to achieve the following recovery attempts:

  1. Restarting app itself
  2. Restarting underlying VM
  3. Rebuilding underlying VM

With Cloud Service it was enough to call Environment.FailFast and automatically triggered the above sequence.

How to achieve the same with Service Fabric? Currently it is being used as deployment/maintenance layer on top of VM Scale Set (one app instance per VM).

Update: It is not possible to do this with Service Fabric. For our service we decided to build it directly on top of VM Scale Set. Hope we'll see Cloud Service v2 built on top of VM Scale Set as well which will take care of deployment/maintenance.


Solution

  • Service Fabric has a built-in mechanism to restart failed apps, but service fabric does not understand what a 'bad state' is. If the application fails and the process shutdown, SF will restart it for a few times until it gives up and consider the app as broken and block restarting.

    If it happens from time to time, for example, a few times a week, it won't have any problem, because there is a threshold on how long it will consider the consecutive failures part of the same problem.

    When you say Bad State, each application might have a different concept of Bad State, so it is not possible to SF identify it unless the application report it via a Health Event.

    Example:

    • The application might be consuming too much memory (memory leak) the only thing you can do is restrict the app memory setting a limit, SF does not know that is a leak, maybe the memory consumption part of the design of your application.
    • Another issue, an API returning error responses because of invalid configuration or a dependent service is down. Service fabric does not knot if the error is because of an application failure or by design of your services.

    In these cases, you have to implement a mechanism to tell SF that these errors were not expect and SF will handle the Failover for you. You could implement it as:

    • Part of your application, it will emit their own Health Reports
    • A watchdog app running in the cluster monitoring the services events or logs and emit events on behalf of other services.

    For the first approach, the quick way to report this as a failure has occured, is using the ReportFault:

    Enables the replica to report a fault to the runtime and indicates that it has encountered an error from which it cannot recover and must either be restarted or removed.

    For more information about other Health Reports take a look in this docs: Service Fabric health reports

    For the item 2 and 3 of your questions, there is mechanism that identify when a node is unavailable within the cluster, it gets demoted and SF will remove it from the RING temporally. Common problems is when a network issue prevents the node to communicate to each other, in some cases, lack of memory might affect the SF Host Manager and it start failing, slow responses, then SF will remove the node from the list of available nodes, until it comes back healthy.

    I am not aware of something in SF that does restart a VM, probably because of the same reasons said previously, it will raise Failures in the SF Explorer to notify about the problem and you will have to handle it.

    You could make a solution as part of the watchdog approach said above, where it will Disable-ServiceFabricNode to move any Healthy services from the node, and then you can restart the underlying VM using the Azure SDK.