Search code examples
apache-sparkmachine-learningglmh2osparkling-water

Executor without H2O instance discovered, killing the cloud


I'm running Tweedie GLM using sparkling water for different sized data ie 20 MB, 400 MB, 2GB,25 GB. Code works fine for Sampling iteration 10. But I have to test for large sampling scenario..

Sampling iteration is 500 

In this case code working well for 20 and 400 mb data.But It starts throwing issue when data is larger than 2 GB

After doing search I found one solution disabling change listener but that did not worked for large data.
--conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false"

Here is my spark submit configuration

spark-submit \
     --packages  ai.h2o:sparkling-water-core_2.10:1.6.1, log4j:log4j:1.2.17\
        --driver-memory 8g \
        --executor-memory 10g \
        --num-executors 10\
        --executor-cores 5 \
        --class TweedieGLM  target/SparklingWaterGLM.jar \
        $1\
        $2\
        --conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false"

This is what I got as an error

16/07/08 20:39:55 ERROR YarnScheduler: Lost executor 2 on cfclbv0152.us2.oraclecloud.com: Executor heartbeat timed out after 175455 ms
    16/07/08 20:40:00 ERROR YarnScheduler: Lost executor 2 on cfclbv0152.us2.oraclecloud.com: remote Rpc client disassociated
    16/07/08 20:40:00 ERROR LiveListenerBus: Listener anon1 threw an exception
    java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
            at org.apache.spark.h2o.H2OContext$$anon$1.onExecutorAdded(H2OContext.scala:203)
            at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:58)
            at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
            at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
            at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
            at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
            at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
            at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
            at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)

Solution

  • After reading carefully the issue posted on github https://github.com/h2oai/sparkling-water/issues/32. I tried couple of options here is what I tried
    Added
    --conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false" "spark.locality.wait=3000" "spark.ext.h2o.network.mask=10.196.64.0/24"

    Changed the :
    
    Executors from 10 to 3,6 9
    executor-memory from 4 to 12 and 12 to 24gb
    driver-memory from 4 to 12 and 12 to 24gb
    

    This is what I learned: GLM is memory intensive job so we have to provide sufficient memory to execute the job.