Search code examples

How to create a VertexId in Apache Spark GraphX using a Long data type?

I'm trying to create a Graph using some Google Web Graph data which can be found here:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

val textFile = sc.textFile("hdfs://")
val arrayForm = textFile.filter(_.charAt(0)!='#').map(_.split("\\s+")).cache()
val nodes = arrayForm.flatMap(array => array).distinct().map(_.toLong)
val edges = => Edge(line(0).toLong,line(1).toLong))

val graph = Graph(nodes,edges)

Unfortunately, I get this error:

<console>:27: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[Long]
 required: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, ?)]
Error occurred in an application involving default arguments.
       val graph = Graph(nodes,edges)

So how can I create a VertexId object? For my understanding it should be sufficient to pass a Long.

Any ideas?

Thanks a lot!



  • Not exactly. If you take a look at the signature of the apply method of the Graph object you'll see something like this (for a full signature see API docs):

    apply[VD, ED](
        vertices: RDD[(VertexId, VD)], edges: RDD[Edge[ED]], defaultVertexAttr: VD)

    As you can read in a description:

    Construct a graph from a collection of vertices and edges with attributes.

    Because of that you cannot simply pass RDD[Long] as a vertices argument ( RDD[Edge[Nothing]] as edges won't work either).

    import scala.{Option, None}
    val nodes: RDD[(VertexId, Option[String])] = arrayForm.
        flatMap(array => array).
        map((_.toLong, None))
    val edges: RDD[Edge[String]] = arrayForm.
        map(line => Edge(line(0).toLong, line(1).toLong, ""))

    Note that:

    Duplicate vertices are picked arbitrarily

    so .distinct() on nodes is obsolete in this case.

    If you want to create a Graph without attributes you can use Graph.fromEdgeTuples.