Search code examples

spark graphframes stateful motif

graph frames has a nice example for stateful motifs. How can I explicitly return the counts? As you see the output only contains vertices and friends but not the counts.

How can I modify it to not (only) have access to the edges but access to the labels of the vertices as well?

when(relationship === "friend", cnt + 1).otherwise(cnt)

I.e. how could I enhance the count to count

  • the friends of each vertex with age > 30
  • the percentage of friendsGreater30 / allFriends

    val g = examples.Graphs.friends  // get example graph
    // Find chains of 4 vertices.
    val chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
    // Query on sequence, with state (cnt)
    //  (a) Define method for updating state given the next element of the motif.
    def sumFriends(cnt: Column, relationship: Column): Column = {
      when(relationship === "friend", cnt + 1).otherwise(cnt)
    //  (b) Use sequence operation to apply method to sequence of elements in motif.
    //      In this case, the elements are the 3 edges.
    val condition = Seq("ab", "bc", "cd").
      foldLeft(lit(0))((cnt, e) => sumFriends(cnt, col(e)("relationship")))
    //  (c) Apply filter to DataFrame.
    val chainWith2Friends2 = chain4.where(condition >= 2)


Which will output

|            a|          ab|            b|          bc|            c|          cd|             d|
|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]| [e,Esther,32]|
|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,b,friend]|    [b,Bob,36]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,d,friend]|  [d,David,29]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,f,follow]|  [f,Fanny,36]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,b,friend]|   [b,Bob,36]|[b,c,follow]|[c,Charlie,30]|
| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]|  [a,Alice,34]|


  • Note that sumFriends returns a Column, so condition is a column. This is why you can access it in a where statement without quotes. So all you have to do is add that column to your dataframe. After running the above code, I can run

     | 1|
     | 0|
     | 0|
     | 0|
     | 0|
     | 3|
     | 3|
     | 3|
     | 2|
     | 2|
     | 3|
     | 1|

    you could also use

    Hope this helps