Search code examples
scalahadoopapache-sparkmapreducespark-graphx

How do i create a Graph in GraphX with this


I am struggling to understand how i am going to create the following in GraphX in Apache spark. I am given the following:

a hdfs file which has loads of data which comes in the form:

node: ConnectingNode1, ConnectingNode2..

For example:

123214: 521345, 235213, 657323

I need to somehow store this data in an EdgeRDD so that i can create my graph in GraphX, but i have no idea how i am going to go about this.


Solution

  • After you read your hdfs source and have your data in rdd, you can try something like the following:

    import org.apache.spark.rdd.RDD
    import org.apache.spark.graphx.Edge
    // Sample data
    val rdd = sc.parallelize(Seq("1: 1, 2, 3", "2: 2, 3"))
    
    val edges: RDD[Edge[Int]] = rdd.flatMap {
      row => 
        // split around ":"
        val splitted = row.split(":").map(_.trim)
        // the value to the left of ":" is the source vertex:
        val srcVertex = splitted(0).toLong
        // for the values to the right of ":", we split around "," to get the other vertices
        val otherVertices = splitted(1).split(",").map(_.trim)
        // for each vertex to the right of ":", we create an Edge object connecting them to the srcVertex:
        otherVertices.map(v => Edge(srcVertex, v.toLong, 1))
    }
    

    Edit

    Additionally, if your vertices have a constant default weight, you can create your graph straight from the Edges, so you don't need to create a verticesRDD:

    import org.apache.spark.graphx.Graph
    val g = Graph.fromEdges(edges, defaultValue = 1)