azure azure-service-fabric azureservicebus azure-servicebus-queues

Long Running Tasks in Service Fabric and Scaling Cluster In

We are using Azure Service Fabric (Stateless Service) which gets messages from the Azure Service Bus Message Queue and processes them. The tasks generally take between 5 mins and 5 hours.

When its busy we want to scale out servers, and when it gets quiet we want to scale back in again.

How do we scale in without interrupting long running tasks? Is there a way we can tell Service Fabric which server is free to scale in?

Solution

Azure Monitor Custom Metric
- Integrate your SF service with EventFlow. For instance, make it sending logs into Application Insights
- While your task is being processed, send some logs in that will indicate that it's in progress
- Configure custom metric in Azure Monitor to scale in only in case on absence of the logs indicating that machine has in-progress tasks

The trade-off here is to wait for all the events finished until the scale-in could happen.

There is a good article that explains how to Scale a Service Fabric cluster programmatically
Here is another approach which requires a bit of coding - Automate manual scaling
- Develop another service either as part of SF application or as VM extension. The point here is to make the service running on all the nodes in a cluster and track the status of tasks execution.
- There are well-defined steps how one could manually exclude SF node from the cluster -
- Run Disable-ServiceFabricNode with intent ‘RemoveNode’ to disable the node you’re going to remove (the highest instance in that node type).
- Run Get-ServiceFabricNode to make sure that the node has indeed transitioned to disabled. If not, wait until the node is disabled. You cannot hurry this step.
- Follow the sample/instructions in the quick start template gallery to change the number of VMs by one in that Nodetype. The instance removed is the highest VM instance.
- And so forth... Find more info here Scale a Service Fabric cluster in or out using auto-scale rules. The takeaway here is that these steps could be automated.

Implement scaling logic in a new service to monitor which nodes are finished with their tasks and stay idle to scale them in using instructions described in previous steps.

Hopefully it makes sense.

Thanks a lot to @tank104 for the help on elaborating my answer!