Search code examples
h2o

What is the difference between h2o standalone and hadoop?


Looking at the h2o docs, it says

Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines. The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading. The data is read in parallel and is distributed across the cluster...

Looking at the h2o downloads page, I see that there is a standalone version of h2o. Wondering what the difference(s) is between these versions? Eg. I assume that the h2o algorithms are intended to use MapReduce algorithm, so would ML training be slower on H2OFrame objs when using standalone mode even if the single host had same memory as if allocated as a YARN application?


Solution

  • The main differences are how the jobs start and whether they have convenient access to HDFS.

    There is no difference in the model training behavior if you give the same amount of nodes and memory/cpu per node.