Search code examples
pythonnetwork-programmingparamikoesxi

How to determine if remote ESXI Host has booted fully?


I am writing a Python Script to fully boot up a handful of ESXI hosts remotely, and I am having trouble with determining when ESXI has finished booting and is ready to receive commands send over SSH. I am running the script on a windows host that is hardwired to each ESXI host and the system is air-gapped so there is no firewalls in the way and no security software would interfere.

Currently I am doing this: I remote into the chassis through SSH and send the power commands to the ESXI host - this works and has always worked. Then, I attempt to SSH into each blade and send the following command

esxcli system stats uptime get

The command doesn't matter, I just need a response to make sure that the host is up. Below is the function I am using to send the SSH commands in hopes of getting a response

def send_command(ip, port, timeout, retry_interval, cmd, user, password):
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    retry_interval = float(retry_interval)
    timeout = int(timeout)
    timeout_start = time.time()
    worked = False 
    while worked == False:
        time.sleep(retry_interval)
        try:
            ssh.connect(ip, port, user, password, timeout=5)
            stdin,stdout,stderr=ssh.exec_command(cmd)
            outlines=stdout.readlines()
            resp=''.join(outlines)
            print(resp)
            worked = True 
            return (resp) 
        except socket_error as e:
            worked = False 
            print(e)
            continue
        except paramiko.ssh_exception.SSHException as e:
            worked = False 
            # socket is open, but not SSH service responded
            print(e) 
            continue
        except TimeoutError as e: 
            print(e)
            worked = False 
            pass
        except socket.timeout as e: 
            print(e)
            worked = False 
            continue
        except paramiko.ssh_exception.NoValidConnectionsError as e:
            print(e)
            worked = False 
            continue
        except socket.error as serr:
            print(serr)
            worked = False 
            continue 
        except IOError as e:
            print(e)
            worked = False 
            continue 
        except: 
            print(e)
            worked = False 
            continue 

My goal here is to catch all of the exceptions long enough for the host to finish booting and then receive a response. The issue is that sometimes it will loop for several minutes (as expected when booting a system like this), and then it will print

IO error: [Errno 111] Connection refused

And then drop out of the function/try catch block and never establish the connection. I know that this is a fault of my exceptions handling because when this happens, I stop the script, wait a few minutes, run it again without touching anything else and the esxcli command will work perfectly and the script will work great.

How do I prevent the Errno 111 error from breaking my loop? Any help is greatly appreciated

Edit: One possible duct tape solution could be changing the command to "esxcli system hostname get" and checking the response for the word "Domain". This might work because the IOError seems to be a response and not an exception, I'll have to wait until monday to test that solution though.


Solution

  • I solved it. It occured to me that I was handling all possible exceptions that any python code could possibly throw, so my defect wasn't a python error and that would make sense why I wasn't finding anything online about the relationship between Python, SSH and the Errno 111 error.

    The print out is in fact a response from the ESXI host, and my code is looking for any response. So I simply changed the esxcli command from requesting the uptime to

    esxcli system hostname get

    and then through this into the function

    substring = "Domain"
    if substring not in resp: 
        print(resp)
        continue
    

    I am looking for the word "Domain" because that must be there if that call is successful.

    How I figure it out: I installed ESXI 7 on an old Intel Nuc, turned on SSH in the kickstart script, started the script and then turned on the nuc. The reason I used the NUC is because a fresh install on simple hardware boots up much faster and quietly than Dell Blades! Also, I wrapped the resp variable in a print(type(OBJECT)) line and was able to determine that it was infact a string and not an error object.

    This may not help someone that has a legitimate Errno 111 error, I knew I was going to run into this error each and everytime I ran the code and I just needed to know how to handle it and hold the loop until I got the response I wanted.

    Edit: I suppose it would be easier to just filter for the world "errno" and then continue the loop instead of using a different substring. That would handle all of my use cases and eliminate the need for a different function.