Search code examples
pythonh2oh2o.ai

How to prevent h2o cluster shutdown without notice using Python


My code loads the h2o MOJO model to get prediction on a small dataset. However, h2o shutdown abruptly by itself. The same code is working fine on one machine with the same set of inputs but seeing abnormal h2o shutdowns on other machine.

self.test = h2o.import_file(dataset_file)
preds = imported_model.predict(self.test)

I am running this on 1TB machine with 72 cores. I can't believe this is memory issues. The most puzzling the fact is that the same code is working on other machine with the same inputs (configured differently). I don't know the full list of differences. I was previously running with python frozen binary build and couldn't see error messages in detail. I am running python code directly and can see error messages in more detail.

 File "h2o_model_eval.py", line 160, in ModelEval
    preds = imported_model.predict(self.test)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/h2o/model/model_base.py", line 334, in predict
    j.poll()
  File ".venv/lib/python3.11/site-packages/h2o/job.py", line 71, in poll
    pb.execute(self._refresh_job_status)
  File ".venv/lib/python3.11/site-packages/h2o/utils/progressbar.py", line 187, in execute
    res = progress_fn()  # may raise StopIteration
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/h2o/job.py", line 136, in _refresh_job_status
    jobs = self._query_job_status_safe()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/h2o/job.py", line 132, in _query_job_status_safe
    raise last_err
  File ".venv/lib/python3.11/site-packages/h2o/job.py", line 114, in _query_job_status_safe
    result = h2o.api("GET /3/Jobs/%s" % self.job_key)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/h2o/h2o.py", line 123, in api
    return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/h2o/backend/connection.py", line 507, in request
    raise H2OConnectionError("Unexpected HTTP error: %s" % e)
h2o.exceptions.H2OConnectionError: Unexpected HTTP error: HTTPConnectionPool(host='localhost', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_acab67512114e05db6ec9865ea9849d3 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2aab3f59ea90>: Failed to establish a new connection: [Errno 111] Connection refused'))
~                                                                               

How to debug this issue?


Solution

  • Without more logs, I cannot see what the problem is. However, I do frequently encounter the following problem:

    I started a H2O-3 cluster and start doing my work. Then, I started another H2O-3 cluster. My first cluster ended up shutting down because the two clusters try to form a H2O-3 cloud (they have the same default names) and my H2O-3 versions are not quite the same or the hash or something is not matching.

    The way to get around this problem is to start each of your h2o-3 cluster with a different name like this:

    java -jar h2o.jar -name "cluster007"

    I hope this will resolve your issue. If not, please give me more logs or codes to reproduce the error.