Search code examples
rcloudera-cdhh2osparklyrsparkling-water

Continuous "Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused" in RSparkling on CDH-5.10.2


I'm trying to execute this RSparkling example on an offline CDH-5.10.2 cluster. My environment is:

  • Spark 1.6.0;
  • sparklyr 0.6.2;
  • h2o 3.10.5.2;
  • rsparkling 0.2.1.

I use custom Sparkling Water JAR which is basically 1.6.12 with this PR applied:

options(rsparkling.sparklingwater.location = "/opt/h2o/sparkling-water-1.6.13-SNAPSHOT/assembly/build/libs/sparkling-water-assembly_2.10-1.6.13-SNAPSHOT-all.jar")

After successful connection:

config <- spark_config()
config$spark.dynamicAllocation.enabled <- "false"
config$spark.driver.memory <- "6g"
config$spark.executor.memory <- "6g"
config$spark.executor.heartbeatInterval <- "20s"

sc <- spark_connect(master = "yarn-client", config = config)

I create H2O context:

h2o_context(sc)

H2O context creation takes few minutes (it's the first strange thing).

After creation, the application becomes unresponsive for another few minutes (even Spark master UI becomes unreachable). No H2O logs are printed at this time.

After that, H2O logs appear but they contain mostly these messages:

Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused

and rare these ones in between:

WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:Zero   + POJO:661.8 MB + FREE:306.7 MB == MEM_MAX:968.5 MB), desiredKV=121.1 MB OOM!

Then the following code that is unrelated to H2O is executed fast:

flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
airports_tbl <- copy_to(sc, nycflights13::airports, "airports")
airlines_tbl <- copy_to(sc, nycflights13::airlines, "airlines")
model_tbl <- flights_tbl %>%
  filter(!is.na(arr_delay) & !is.na(dep_delay) & !is.na(distance)) %>%
  filter(dep_delay > 15 & dep_delay < 240) %>%
  filter(arr_delay > -60 & arr_delay < 360) %>%
  left_join(airlines_tbl, by = c("carrier" = "carrier")) %>%
  mutate(gain = dep_delay - arr_delay) %>%
  select(origin, dest, carrier, airline = name, distance, dep_delay, arr_delay, gain)

But when H2O must come into play again:

df_hex <- as_h2o_frame(sc,model_tbl,name="model_hex",FALSE)

the application hangs again (to the moment, it has been hanging twenty minutes or so).

I tried to rerun this code multiple times and succeeded once but normally it just hangs. How to troubleshoot this?

I checked CPU, RAM, and disk usage, all these seems to be OK. There are no evident network problems as well.

Update 1. Maybe ConnectException is just a consequence of K/V:Zero + POJO:661.8 MB + FREE:306.7 MB == MEM_MAX:968.5 MB. So I will try to find out how to increase H2O's max memory (and why it's below 1 GB in the first place).


Solution

  • The root cause was insufficient memory allocation for sparklyr, the default 1 GB of memory was not enough for H2O client which was executed in the same JVM. These lines of code saved the day:

    config$`sparklyr.shell.driver-memory` <- "6g"
    config$`sparklyr.shell.executor-memory` <- "6g"