I am struggling to understand how i am going to create the following in GraphX in Apache spark. I am given the following:
a hdfs file which has loads of data which comes in the form:
node: ConnectingNode1, ConnectingNode2..
For example:
123214: 521345, 235213, 657323
I need to somehow store this data in an EdgeRDD so that i can create my graph in GraphX, but i have no idea how i am going to go about this.
After you read your hdfs source and have your data in rdd
, you can try something like the following:
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.Edge
// Sample data
val rdd = sc.parallelize(Seq("1: 1, 2, 3", "2: 2, 3"))
val edges: RDD[Edge[Int]] = rdd.flatMap {
row =>
// split around ":"
val splitted = row.split(":").map(_.trim)
// the value to the left of ":" is the source vertex:
val srcVertex = splitted(0).toLong
// for the values to the right of ":", we split around "," to get the other vertices
val otherVertices = splitted(1).split(",").map(_.trim)
// for each vertex to the right of ":", we create an Edge object connecting them to the srcVertex:
otherVertices.map(v => Edge(srcVertex, v.toLong, 1))
}
Edit
Additionally, if your vertices have a constant default weight, you can create your graph straight from the Edges, so you don't need to create a verticesRDD:
import org.apache.spark.graphx.Graph
val g = Graph.fromEdges(edges, defaultValue = 1)