Search code examples
azurerhelazure-resource-managernatoutbound

Not getting any external outbound traffic from nodes in a Azure VM Scale Set behind a loadbalancer


I'm experiencing difficulties regarding accessing external resources from nodes (RHEL) configured in a VM Scale Set.

To sketch the environment I'm trying to describe using Azure Resource Manager Templates, I'm looking to create:

  • 1 common virtualNetwork
  • 1 Frontend VM (running RHEL, and is working as intended)
  • 1 Cluster (vmss) running 2 nodes (RHEL)
    • Nodes are spawned in the same private subnet as the frontend VM
    • 1 loadbalancer should work as a NAT gateway (but it's not working this way)
      • The loadbalancer has an external IP, inboundNatPool (which works), backendAddressPool (in which nodes are successfully registered)
    • the Network Security Group manages access to ports (set to allow all outbound connections)

As a footnote, I'm comfortable writing up AWS cloudformation files in YAML, so I'm handling Azure Resource Manager Templates in a similar way, for the sake of readability and the added functionality of adding comments in my template.

An Example of my vmss config (short snippet)

... #(yaml-template is first converted to json and than deployed using the azure cli)
#   Cluster
#   -------
#     Scale Set
#     ---------
#       | VM Scale Set can not connect to external sources
#       |
- type: Microsoft.Compute/virtualMachineScaleSets
  name: '[variables(''vmssName'')]'
  location: '[resourceGroup().location]'
  apiVersion: '2017-12-01'
  dependsOn:
  - '[variables(''vnetName'')]'
  - '[variables(''loadBalancerName'')]'
  - '[variables(''networkSecurityGroupName'')]'
  sku:
    capacity: '[variables(''instanceCount'')]' # Amount of nodes to be spawned
    name: Standard_A2_v2
    tier: Standard
  # zones: # If zone is specified, no sku can be chosen
  # - '1'
  properties:
    overprovision: 'true'
    upgradePolicy:
      mode: Manual
    virtualMachineProfile:
      networkProfile:
        networkInterfaceConfigurations:
        - name: '[variables(''vmssNicName'')]'
          properties:
            ipConfigurations:
            - name: '[variables(''ipConfigName'')]'
              properties:
                loadBalancerBackendAddressPools:
                - id: '[variables(''lbBackendAddressPoolsId'')]'
                loadBalancerInboundNatPools:
                - id: '[variables(''lbInboundNatPoolsId'')]'
                subnet:
                  id: '[variables(''subnetId'')]'
            primary: true
            networkSecurityGroup:
              id: '[variables(''networkSecurityGroupId'')]'
      osProfile:
        computerNamePrefix: '[variables(''vmssName'')]'
        adminUsername: '[parameters(''sshUserName'')]'
        # adminPassword: '[parameters(''adminPassword'')]'
        linuxConfiguration:
          disablePasswordAuthentication: True
          ssh:
            publicKeys:
            - keyData: '[parameters(''sshPublicKey'')]'
              path: '[concat(''/home/'',parameters(''sshUserName''),''/.ssh/authorized_keys'')]'
      storageProfile:
        imageReference: '[variables(''clusterImageReference'')]'
        osDisk:
          caching: ReadWrite
          createOption: FromImage
...

The Network Security Group referenced from the template above is:

#     NetworkSecurityGroup
#     --------------------
- type: Microsoft.Network/networkSecurityGroups
  name: '[variables(''networkSecurityGroupName'')]'
  apiVersion: '2017-10-01'
  location: '[resourceGroup().location]'
  properties:
    securityRules:
    - name: remoteConnection
      properties:
        priority: 101
        access: Allow
        direction: Inbound
        protocol: Tcp
        description: Allow SSH traffic
        sourceAddressPrefix: '*'
        sourcePortRange: '*'
        destinationAddressPrefix: '*'
        destinationPortRange: '22'
    - name: allow_outbound_connections
      properties:
        description: This rule allows outbound connections
        priority: 200
        access: Allow
        direction: Outbound
        protocol: '*'
        sourceAddressPrefix: 'VirtualNetwork'
        sourcePortRange: '*'
        destinationAddressPrefix: '*'
        destinationPortRange: '*'

And the loadbalancer, where I assume the error should be, is described as:

#   Loadbalancer as NatGateway
#   --------------------------
- type: Microsoft.Network/loadBalancers
  name: '[variables(''loadBalancerName'')]'
  apiVersion: '2017-10-01'
  location: '[resourceGroup().location]'
  sku:
    name: Standard
  dependsOn:
  - '[variables(''natIPAddressName'')]'
  properties:
    backendAddressPools:
    - name: '[variables(''lbBackendPoolName'')]'
    frontendIPConfigurations:
    - name: LoadBalancerFrontEnd
      properties:
        publicIPAddress:
          id: '[variables(''natIPAddressId'')]'
    inboundNatPools:
    - name: '[variables(''lbNatPoolName'')]'
      properties:
        backendPort: '22'
        frontendIPConfiguration:
          id: '[variables(''frontEndIPConfigID'')]'
        frontendPortRangeStart: '50000'
        frontendPortRangeEnd: '50099'
        protocol: tcp

I keep reading articles about configuring a SNAT with port masquerading, but I'm missing relevant examples of such setup.

Any help is greatly appreciated.


Solution

  • It took a lot of searching but the article from Azure about Azure Load Balancer outbound Connections (Scenario #2) stated a load-balancing rule (and complementary Health Probe) was necessary for SNAT to function.

    the new code for the load balancer became:

    ...
    - type: Microsoft.Network/loadBalancers
      name: '[variables(''loadBalancerName'')]'
      apiVersion: '2017-10-01'
      location: '[resourceGroup().location]'
      sku:
        name: Standard
      dependsOn:
      - '[variables(''natIPAddressName'')]'
      properties:
        backendAddressPools:
        - name: '[variables(''lbBackendPoolName'')]'
        frontendIPConfigurations:
        - name: LoadBalancerFrontEnd
          properties:
            publicIPAddress:
              id: '[variables(''natIPAddressId'')]'
        probes:  # Needed for loadBalancingRule to work
        - name: '[variables(''lbProbeName'')]'
          properties:
            protocol: Tcp
            port: 22
            intervalInSeconds: 5
            numberOfProbes: 2
        loadBalancingRules:  # Needed for SNAT to work
        - name: '[concat(variables(''loadBalancerName''),''NatRule'')]'
          properties:
            disableOutboundSnat: false
            frontendIPConfiguration:
              id: '[variables(''frontEndIPConfigID'')]'
            backendAddressPool:
              id: '[variables(''lbBackendAddressPoolsId'')]'
            probe:
              id: '[variables(''lbProbeId'')]'
            protocol: tcp
            frontendPort: 80
            backendPort: 80
    ...