apache-spark spark-graphx apache-zeppelin

Apache Zeppelin not showing Spark output

I am testing Zeppelin with Spark using the following data sample:

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

val vertexArray = Array(
(1L, ("Alice", 28)),
(2L, ("Bob", 27)),
(3L, ("Charlie", 65)),
(4L, ("David", 42)),
(5L, ("Ed", 55)),
(6L, ("Fran", 50))
)
val edgeArray = Array(
Edge(2L, 1L, 7),
Edge(2L, 4L, 2),
Edge(3L, 2L, 4),
Edge(3L, 6L, 3),
Edge(4L, 1L, 1),
Edge(5L, 2L, 2),
Edge(5L, 3L, 8),
Edge(5L, 6L, 3)
)

val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

I have noticed that Zeppelin is not always able to display the output even though the code works fine in Spark-Shell. Below is an example, any idea how to fix this?

graph.vertices.filter { case (id, (name, age)) => age > 30 }.foreach {
case (id, (name, age)) => println(s"$name is $age")
}

Solution

There is really nothing to fix here. It is simply an expected behavior. Code inside foreach closure is executed on the workers not on the driver where your notebook is running. Its output can be captured depending on your configuration but it is not something you can depend on.

If you want to output things from the driver program the best option is to collect or transform toLocalIterator and iterate locally:

graph.vertices.filter { case (id, (name, age)) => age > 30 }.collect.foreach {
  case (id, (name, age)) => println(s"$name is $age")
}