scala hadoop apache-spark mapreduce spark-graphx

How do i create a Graph in GraphX with this

I am struggling to understand how i am going to create the following in GraphX in Apache spark. I am given the following:

a hdfs file which has loads of data which comes in the form:

node: ConnectingNode1, ConnectingNode2..

For example:

123214: 521345, 235213, 657323

I need to somehow store this data in an EdgeRDD so that i can create my graph in GraphX, but i have no idea how i am going to go about this.

Solution

After you read your hdfs source and have your data in rdd, you can try something like the following:

import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.Edge
// Sample data
val rdd = sc.parallelize(Seq("1: 1, 2, 3", "2: 2, 3"))

val edges: RDD[Edge[Int]] = rdd.flatMap {
  row => 
    // split around ":"
    val splitted = row.split(":").map(_.trim)
    // the value to the left of ":" is the source vertex:
    val srcVertex = splitted(0).toLong
    // for the values to the right of ":", we split around "," to get the other vertices
    val otherVertices = splitted(1).split(",").map(_.trim)
    // for each vertex to the right of ":", we create an Edge object connecting them to the srcVertex:
    otherVertices.map(v => Edge(srcVertex, v.toLong, 1))
}

Edit

Additionally, if your vertices have a constant default weight, you can create your graph straight from the Edges, so you don't need to create a verticesRDD:

import org.apache.spark.graphx.Graph
val g = Graph.fromEdges(edges, defaultValue = 1)