I am running the following snippet using GraphX:
val g = Graph(
v.persist(StorageLevel.MEMORY_AND_DISK_SER),
e.persist(StorageLevel.MEMORY_AND_DISK_SER),
1,
StorageLevel.MEMORY_AND_DISK_SER,
StorageLevel.MEMORY_AND_DISK_SER)
val pageRankResult = g.pageRank(0.0001)
I am warned at runtime with:
[WARN - org.apache.spark.graphx.impl.ShippableVertexPartitionOps] - Diffing two VertexPartitions with different indexes is slow.
and
[WARN - org.apache.spark.graphx.impl.ShippableVertexPartitionOps] - Joining two VertexPartitions with different indexes is slow.
I read the answer of the topic Get Joining two VertexPartitions with different indexes is slow in Spark and GraphX by unpersist graph, but in my case everything is persisted.
What am I doing wrong ?
You must not use a _SER
suffixed storage level in order to leverage the fast zip join of VertexPartitions
.
Switching from MEMORY_AND_DISK_SER
levels to MEMORY_AND_DISK
might make you gain in computation time (and you will not see warnings anymore) but the cached data will take more space.