I have a Jupyter notebook connected to a Sparkling Water instance, running on a Hadoop cluster.
This is my assumption about how the processing works:
Am I right?
Is this how it works?
The bigger topic I am trying to explain is whether Sparkling Water runs the H2O algorithms in a distributed manner and utilizes the available cluster resources.
is whether Sparkling Water runs the H2O algorithms in a distributed manner and utilizes the available cluster resources
Yes.
Sparkling Water embeds H2O nodes within Spark executors. So a Sparkling Water job will train H2O models in the exact same way that core H2O-3 does (with no Spark in the picture).
An H2O cluster does not like nodes to join or leave once running, so you must set the spark dynamicAllocation property to disabled.
There is no performance improvement or reduction from the Spark-ness of Sparkling Water. Rather, it is a friendly way to introduce H2O machine learning models into a Spark environment or pipeline.
Here is a pointer to the Sparkling Water design documentation, which has a picture illustrating the above - http://docs.h2o.ai/sparkling-water/2.3/latest-stable/doc/design/design.html.