Search code examples
apache-sparkjoinspark-graphx

Spark's GraphX: Why am I getting "Joining / Diffing two VertexPartitions with different indexes is slow" even after having persisted involved RDDs


I am running the following snippet using GraphX:

val g = Graph(
      v.persist(StorageLevel.MEMORY_AND_DISK_SER),
      e.persist(StorageLevel.MEMORY_AND_DISK_SER),
      1,
      StorageLevel.MEMORY_AND_DISK_SER,
      StorageLevel.MEMORY_AND_DISK_SER)

val pageRankResult = g.pageRank(0.0001)

I am warned at runtime with:

  • [WARN - org.apache.spark.graphx.impl.ShippableVertexPartitionOps] - Diffing two VertexPartitions with different indexes is slow.

and

  • [WARN - org.apache.spark.graphx.impl.ShippableVertexPartitionOps] - Joining two VertexPartitions with different indexes is slow.

I read the answer of the topic Get Joining two VertexPartitions with different indexes is slow in Spark and GraphX by unpersist graph, but in my case everything is persisted.

What am I doing wrong ?


Solution

  • You must not use a _SER suffixed storage level in order to leverage the fast zip join of VertexPartitions.

    Switching from MEMORY_AND_DISK_SER levels to MEMORY_AND_DISK might make you gain in computation time (and you will not see warnings anymore) but the cached data will take more space.