Search code examples
azureazure-service-fabricazure-vm-scale-set

How to change the OS on an existing Service Fabric cluster?


I'm trying to change my VMSS from:

    "imageReference": {
      "publisher": "MicrosoftWindowsServer",
      "offer": "WindowsServer",
      "sku": "2016-Datacenter-with-Containers",
      "version": "latest"
    }

To:

    "imageReference": {
      "publisher": "MicrosoftWindowsServer",
      "offer": "WindowsServerSemiAnnual",
      "sku": "Datacenter-Core-1803-with-Containers-smalldisk",
      "version": "latest"
    }

The first thing I tried was:

Update-AzureRmVmss -ResourceGroupName "DevServiceFabric" -VMScaleSetName "HTTP" -ImageReferenceSku Datacenter-Core-1803-with-Containers-smalldisk -ImageReferenceOffer WindowsServerSemiAnnual

Which gives me the error:

Update-AzureRmVmss : Changing property 'imageReference.offer' is not allowed. ErrorCode: PropertyChangeNotAllowed

This is confirmed in the docs; you can only set the offer when the scaleset is created.

Next I tried Add-AzureRmServiceFabricNodeType to add a new node type, thinking I could just delete the old one after. However, this command doesn't seem to allow you to set the OS image. You can only set the VM SKU (In other words, all VMs on your cluster have to have the same OS).

Is there a way to change this without deleting the entire cluster and starting from scratch?


Solution

  • Edit If you can stay within the current publisher+offer, you can switch the OS very easily by simply changing the SKU. See the answer by Mike.


    If you really need to change the offer, you can do this:

    Upgrade the size and operating system of the primary node type VMs.

    Be aware that you need to consider a lot of things like your availablility level. The cluster will also be unavailable from the outside for a time.

    Shortened drastically:

    • add a second scale set with your desired OS to the primary node type
    • disable the old scale set, then remove it
    • switch over the load balancer
    # Variables.
    $groupname = "sfupgradetestgroup"
    $clusterloc="southcentralus"  
    $subscriptionID="<your subscription ID>"
    
    # sign in to your Azure account and select your subscription
    Login-AzAccount -SubscriptionId $subscriptionID 
    
    # Create a new resource group for your deployment and give it a name and a location.
    New-AzResourceGroup -Name $groupname -Location $clusterloc
    
    # Deploy the two node type cluster.
    New-AzResourceGroupDeployment -ResourceGroupName $groupname -TemplateParameterFile "C:\temp\cluster\Deploy-2NodeTypes-2ScaleSets.parameters.json" `
        -TemplateFile "C:\temp\cluster\Deploy-2NodeTypes-2ScaleSets.json" -Verbose
    
    # Connect to the cluster and check the cluster health.
    $ClusterName= "sfupgradetest.southcentralus.cloudapp.azure.com:19000"
    $thumb="F361720F4BD5449F6F083DDE99DC51A86985B25B"
    
    Connect-ServiceFabricCluster -ConnectionEndpoint $ClusterName -KeepAliveIntervalInSec 10 `
        -X509Credential `
        -ServerCertThumbprint $thumb  `
        -FindType FindByThumbprint `
        -FindValue $thumb `
        -StoreLocation CurrentUser `
        -StoreName My 
    
    Get-ServiceFabricClusterHealth
    
    # Deploy a new scale set into the primary node type.  Create a new load balancer and public IP address for the new scale set.
    New-AzResourceGroupDeployment -ResourceGroupName $groupname -TemplateParameterFile "C:\temp\cluster\Deploy-2NodeTypes-3ScaleSets.parameters.json" `
        -TemplateFile "C:\temp\cluster\Deploy-2NodeTypes-3ScaleSets.json" -Verbose
    
    # Check the cluster health again. All 15 nodes should be healthy.
    Get-ServiceFabricClusterHealth
    
    # Disable the nodes in the original scale set.
    $nodeNames = @("_NTvm1_0","_NTvm1_1","_NTvm1_2","_NTvm1_3","_NTvm1_4")
    
    Write-Host "Disabling nodes..."
    foreach($name in $nodeNames){
        Disable-ServiceFabricNode -NodeName $name -Intent RemoveNode -Force
    }
    
    Write-Host "Checking node status..."
    foreach($name in $nodeNames){
    
        $state = Get-ServiceFabricNode -NodeName $name 
    
        $loopTimeout = 50
    
        do{
            Start-Sleep 5
            $loopTimeout -= 1
            $state = Get-ServiceFabricNode -NodeName $name
            Write-Host "$name state: " $state.NodeDeactivationInfo.Status
        }
    
        while (($state.NodeDeactivationInfo.Status -ne "Completed") -and ($loopTimeout -ne 0))
    
    
        if ($state.NodeStatus -ne [System.Fabric.Query.NodeStatus]::Disabled)
        {
            Write-Error "$name node deactivation failed with state" $state.NodeStatus
            exit
        }
    }
    
    # Remove the scale set
    $scaleSetName="NTvm1"
    Remove-AzVmss -ResourceGroupName $groupname -VMScaleSetName $scaleSetName -Force
    Write-Host "Removed scale set $scaleSetName"
    
    $lbname="LB-sfupgradetest-NTvm1"
    $oldPublicIpName="PublicIP-LB-FE-0"
    $newPublicIpName="PublicIP-LB-FE-2"
    
    # Store DNS settings of public IP address related to old Primary NodeType into variable 
    $oldprimaryPublicIP = Get-AzPublicIpAddress -Name $oldPublicIpName  -ResourceGroupName $groupname
    
    $primaryDNSName = $oldprimaryPublicIP.DnsSettings.DomainNameLabel
    
    $primaryDNSFqdn = $oldprimaryPublicIP.DnsSettings.Fqdn
    
    # Remove Load Balancer related to old Primary NodeType. This will cause a brief period of downtime for the cluster
    Remove-AzLoadBalancer -Name $lbname -ResourceGroupName $groupname -Force
    
    # Remove the old public IP
    Remove-AzPublicIpAddress -Name $oldPublicIpName -ResourceGroupName $groupname -Force
    
    # Replace DNS settings of Public IP address related to new Primary Node Type with DNS settings of Public IP address related to old Primary Node Type
    $PublicIP = Get-AzPublicIpAddress -Name $newPublicIpName  -ResourceGroupName $groupname
    $PublicIP.DnsSettings.DomainNameLabel = $primaryDNSName
    $PublicIP.DnsSettings.Fqdn = $primaryDNSFqdn
    Set-AzPublicIpAddress -PublicIpAddress $PublicIP
    
    # Check the cluster health
    Get-ServiceFabricClusterHealth
    
    # Remove node state for the deleted nodes.
    foreach($name in $nodeNames){
        # Remove the node from the cluster
        Remove-ServiceFabricNodeState -NodeName $name -TimeoutSec 300 -Force
        Write-Host "Removed node state for node $name"
    }