Search code examples
apache-sparkh2o

H2O Sparkling Water architecture


I have a Jupyter notebook connected to a Sparkling Water instance, running on a Hadoop cluster.

This is my assumption about how the processing works:

  1. The user code from the notebook is submitted to the running Sparkling Water instance.
  2. Sparkling Water translates it to use Spark API commands.
  3. It is submitted as a Spark job to the cluster.
  4. Spark executes it as any other job.

Am I right?
Is this how it works?

The bigger topic I am trying to explain is whether Sparkling Water runs the H2O algorithms in a distributed manner and utilizes the available cluster resources.


Solution

  • is whether Sparkling Water runs the H2O algorithms in a distributed manner and utilizes the available cluster resources

    Yes.

    Sparkling Water embeds H2O nodes within Spark executors. So a Sparkling Water job will train H2O models in the exact same way that core H2O-3 does (with no Spark in the picture).

    An H2O cluster does not like nodes to join or leave once running, so you must set the spark dynamicAllocation property to disabled.

    There is no performance improvement or reduction from the Spark-ness of Sparkling Water. Rather, it is a friendly way to introduce H2O machine learning models into a Spark environment or pipeline.

    Here is a pointer to the Sparkling Water design documentation, which has a picture illustrating the above - http://docs.h2o.ai/sparkling-water/2.3/latest-stable/doc/design/design.html.