Search code examples
hadoopapache-sparkspark-graphx

Storing graphx vertices on HDFS and loading later


I create an RDD:

val verticesRDD: RDD[(VertexId, Long)] = vertices

I can inspect it and everything looks ok:

verticesRDD.take(3).foreach(println)
(4000000031043205,1)
(4000000031043206,2)
(4000000031043207,3)

I save this RDD to HDFS via:

verticesRDD.saveAsObjectFile("location/vertices")

I then try and read this file to make sure it worked:

val verticesRDD_check = sc.textFile("location/vertices")

This works fine, however when I try and inspect, something is wrong.

verticesRDD_check.take(2).foreach(println)
    SEQ!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritablea��:Y4o�e���v������ur[Lscala.Tuple2;.���O��xp
srscala.Tuple2$mcJJ$spC�~��f��J _1$mcJ$spJ  _2$mcJ$spxr
                                                           scala.Tuple2�}��F!�L_1tLjava/lang/Object;L_2q~xppp5���sq~pp5���sq~pp5���sq~pp5���sq~pp5���esq~pp5���hsq~pp5��୑sq~pp5���sq~pp5���q    sq~pp5��ஓ

Is there an issue in how I save the RDD using saveAsObjectFile? Or is it reading via textFile?


Solution

  • When you read it back, you need to specify the type.

    val verticesRDD : RDD[(VertexId, Long)] = sc.objectFile("location/vertices")