Search code examples
h2osparkling-water

Understanding Sparkling Water


I am new to Sparkling Water, I want to ask some quick questions:

  1. Does Sparking Water support all the algorithms that both Spark MLlib and H2O provides

  2. Does Sparkling Water itself provide algorithms that Spark MLlib and H2O don't support?

  3. If I want to write code with pure Spark MLlib within Sparkling Water context, should I have to use H2OContext or Sparkling Water related API?

Per the above 3 questions, I think what I want to understand is how Sparkling Water works. (For present, I know no more than that Sparkling Water brings Spark and H2O together)

Thanks.

Questions-2017-01-11

I am able to run the AirlinesWithWeatherDemo2example with run-example.shsuccessfully, but I got two questions:

  1. H2O Flow web ui is opened during application running(can be accessed through 54321 port), but when the application is finished, the process that opens 54321 port is also shut down(the web ui is inaccessible any more), I would ask when I am running the example, what functionality does this flow UI provide since it may be short-lived

  2. Sparkling water is meant to integrate Spark and H2O, when I submit the example, I only need the sparkling-water-assembly_2.11-2.0.3-all as the applicaiton jar(It contains the example classes), It looks that if I want to run H2O algorithms that Sparkling water doesn't provide, I should add the H2O jars(h2o.jar) as the dependent jars?


Solution

    1. Yes

    2. Not really, we are working on wrapping Spark's MLlib algorithms so you can run them from H2O's FlowUI and on wrapping H2O's algorithms so you can use them in MLlib's pipelines, though.

    3. You need H2OContext only if you want to run H2O specific functionality.

    Sparkling Water simply allows you to run H2O nodes inside Spark nodes, instead of bootstrapping the H2O cluster by hand. This also allows you to use data in both H2O and Spark.

    @Edit:

    1. None but you might have a long running Spark job, where you don't exit after doing some initial computation but lock the job (and need to kill it somehow). Then you can use FlowUI as normal. We simply start the HTTP server every time (even for demos). No reason not do to it.

    2. You can either use one of our droplets - https://github.com/h2oai/h2o-droplets/tree/master/sparkling-water-droplet which is a template project, you add your logic in the main class and run ./gradlew shadowJar and submit the jar with spark-submit, it already contains all the jars. Or, as you mentioned you'll need to provide (though --jars or --packages) all the necessary dependencies, H2O.jar included.