Search code examples
ansiblewinrm

Retry task on a Windows node if unreachable


Is there a way to retry a task if the Windows node is temporarily unreachable?

For example, I tried

- name: Hello
  ansible.windows.win_powershell:
    script: | 
      Write-Host "hello"
  register: _status
  until: _status is not unreachable
  retries: 3
  delay: 200

But, after 30 seconds, I got

fatal: [mylocalwin]: UNREACHABLE! => changed=false 
  msg: 'certificate: HTTPSConnectionPool(host=''xxx.xxx.xxx.xxx'', port=5986): Max retries exceeded with url: /wsman (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f4160b63eb0>, ''Connection to xxx.xxx.xxx.xxx timed out. (connect timeout=30)''))'
  unreachable: true

I would like to retry three times before failing.


Solution

  • Here there is my solution based on https://github.com/ansible/ansible/issues/25532#issuecomment-428386816

    Modify

    /lib/python3.10/site-packages/winrm/protocol.py

    class Protocol(object):
        def __init__(
                ...
                reconnection_retries=0,
                reconnection_backoff_factor=2.0
            ):
            ...
            
            self.transport = Transport(
                ...      
                reconnection_retries=reconnection_retries,
                reconnection_backoff_factor=reconnection_backoff_factor
            )
    

    /lib/python3.10/site-packages/winrm/transport.py

    class Transport(object):
        def __init__(
            ...
            reconnection_retries=0,
            reconnection_backoff_factor=2.0):
            
            ...
            self.reconnection_retries = reconnection_retries
            self.reconnection_backoff_factor = reconnection_backoff_factor
            ...
            
        def build_session(self):
            ...
            
            # Merge proxy environment variables
            settings = session.merge_environment_settings(url=self.endpoint,
                          proxies=proxies, stream=None, verify=None, cert=None)
            # ADD
            # Retry on connection errors, with a backoff factor
            retries = requests.packages.urllib3.util.retry.Retry(total=self.reconnection_retries,
                                                                 connect=self.reconnection_retries,
                                                                 status=self.reconnection_retries,
                                                                 read=0,
                                                                 backoff_factor=self.reconnection_backoff_factor,
                                                                 status_forcelist=(413, 425, 429, 503))
            # ADD
            session.mount('http://', requests.adapters.HTTPAdapter(max_retries=retries))
            session.mount('https://', requests.adapters.HTTPAdapter(max_retries=retries))  
            ...      
    

    Now it is possible to control the retry when the node is unreachable

    - name: Test
      hosts: mylocalwin
      gather_facts: false
      vars:
        ansible_winrm_reconnection_backoff_factor: 2.0
        ansible_winrm_reconnection_retries: 4
    
      tasks:
        - name: Hello
          ansible.windows.win_powershell:
            script: | 
              Write-Host "hello"
    

    I checked the solution with tcpdump and I can confirm then the TCP SYN groups are re-sent for reconnection_retries times.

    Here there is a small recap about performaces

    TYPE                ERROR DETECTION (sec)   NUM OF TCP SYN SENT
    RETRY_0_BACKOFF_2   30                      5
    RETRY_1_BACKOFF_2   60                      10
    RETRY_2_BACKOFF_2   94                      15
    RETRY_3_BACKOFF_2   133                     20
    RETRY_4_BACKOFF_2   179                     25
    RETRY_5_BACKOFF_2   240                     30
    NO_RETRY_MECHANISM  30                      5