Search code examples
scalaapache-sparkrdfspark-graphx

I need to do join/joinVertices or add a field in tuple in graph by Spark Graphx


I have a RDF graph(link) with tuples(s,p,o) and I made a property graph from that. My RDF property graph is obtained by following code(Complete code):

val propGraph = Graph(vertexArray,edgeArray).cache()
propGraph.triplets.foreach(println(_))

with output as below:

((vId_src,src_att),(vId_dst,dst_att),property)

and RDF data as:

((0,<http://umkc.edu/xPropGraph#franklin>),(1,http://umkc.edu/xPropGraph#rxin>),<http://umkc.edu/xPropGraph#advisor>)
((1,<http://umkc.edu/xPropGraph#rxin>),(2,<http://umkc.edu/xPropGraph#jgonzal>),<http://umkc.edu/xPropGraph#collab>)
((2147483648,<http://umkc.edu/xPropGraph#peter>),(4294967295,<http://umkc.edu/xPropGraph#John),<http://umkc.edu/xPropGraph#student>)
((6442450942,<http://umkc.edu/xPropGraph#istoica>),(0,<http://umkc.edu/xPropGraph#franklin>),<http://umkc.edu/xPropGraph#colleague>)
((0,<http://umkc.edu/xPropGraph#franklin>),(2,<http://umkc.edu/xPropGraph#jgonzal>),<http://umkc.edu/xPropGraph#pi>)

When I apply connectedComponents()I get cc graph with ccID as bellow-

val cc = propGraph.connectedComponents().cache()
cc.triplets.foreach(println(_))

With output as:

((0,0),(2,0),<http://umkc.edu/xPropGraph#pi>)
((0,0),(1,0),<http://umkc.edu/xPropGraph#advisor>)
((1,0),(2,0),<http://umkc.edu/xPropGraph#collab>)
((2147483648,2147483648),(4294967295,2147483648),<http://umkc.edu/xPropGraph#student>)
((6442450942,0),(0,0),<http://umkc.edu/xPropGraph#colleague>)

I need to get something like:

((vId_src,src_att),(vId_dst,dst_att),property, ccID)

i.e. I need result in this triplet/graph format:

((0,<http://umkc.edu/xPropGraph#franklin>),(2,<http://umkc.edu/xPropGraph#jgonzal>),<http://umkc.edu/xPropGraph#pi>,0)
((6442450942,<http://umkc.edu/xPropGraph#istoica>),(0,<http://umkc.edu/xPropGraph#franklin>),<http://umkc.edu/xPropGraph#colleague>,0)
((0,<http://umkc.edu/xPropGraph#franklin>),(1,<http://umkc.edu/xPropGraph#rxin>),<http://umkc.edu/xPropGraph#advisor>,0)
((1,<http://umkc.edu/xPropGraph#rxin>),(2,<http://umkc.edu/xPropGraph#jgonzal>),<http://umkc.edu/xPropGraph#collab>,0)
((2147483648,<http://umkc.edu/xPropGraph#peter>),(4294967295,<http://umkc.edu/xPropGraph#John),<http://umkc.edu/xPropGraph#student>,2147483648)

so the option I do have might be from join. I tried to do something like val triplets = propGraph.joinVertices(cc.vertices) but not able to do properly. Is there any way to get this?

Any help is appreciated!! I am newbie in Graphx.:)


Solution

  • As I was looking for ((vId_src,src_att),(vId_dst,dst_att),property, ccID) so I used zip() for two RDDs.

     val cc: Graph[graphx.VertexId,String] = propGraph.connectedComponents().cache()
        println("###GRAPH WITH CONNECTED COMPONENTS ###")
        cc.triplets.foreach(println(_))
        println("###VERTICES OF CONNECTED COMPONENTS GRAPH ###")
        cc.vertices.foreach(println(_))
        println("###EDGES OF CONNECTED COMPONENTS GRAPH  ###")
        cc.edges.foreach(println(_))
    
    
    /**
     * Alternative way for join operation*/
    println("###STEP-2 GETTING ONE MERGED RDD OF NEW GRAPH###")
    val newGraph: RDD[String] = propGraph.triplets.map(t =>t.srcId +","+ t.srcAttr+"),"+"("+t.dstId+","+ t.dstAttr+"),"+t.attr)
    val ccID: RDD[String]=cc.triplets.map(t=>t.srcAttr+"")
    val newPropGraph: RDD[(String,String)]= newGraph.zip(ccID)
    newPropGraph.collect.foreach(println(_))
    

    After doing so I got following as output:

    (4294967296,<http://umkc.edu/xPropGraph#node1>),(2147483649,<http://umkc.edu/xPropGraph#node2>),<http://umkc.edu/xPropGraph#prop1>,0)
    (2147483649,<http://umkc.edu/xPropGraph#node2>),(6442450942,<http://umkc.edu/xPropGraph#node4>),<http://umkc.edu/xPropGraph#prop5>,0)
    (4294967295,<http://umkc.edu/xPropGraph#node5>),(2147483648,<http://umkc.edu/xPropGraph#node6>),<http://umkc.edu/xPropGraph#prop3>,2147483648)
    (0,<http://umkc.edu/xPropGraph#node3>),(6442450942,<http://umkc.edu/xPropGraph#node4>),<http://umkc.edu/xPropGraph#prop2>,0)
    (2147483649,<http://umkc.edu/xPropGraph#node2>),(0,<http://umkc.edu/xPropGraph#node3>),<http://umkc.edu/xPropGraph#prop4>,0)