Search code examples
scalaapache-sparkspark-graphx

Converting an array of edges and vertices to a graph friedly format


I have extracted the links between the wikipedia pages in an RDD which has the following format:

Array[(String, String)] = Array((AccessibleComputing,[Computer accessibility]), 
                      (Anarchism,[political philosophy, stateless society]))

Where the first string is a page (Vertex) and the second is a list of links (Edges) pointing towards other Wiki pages.

How can I convert it into, graph friendly format like that:

Array(
(AccessibleComputing,Computer accessibility),
(Anarchism,stateless society),
(Anarchism,political philosophy)
)

so that the edge is repeated for each vertex


Solution

  • drop, split and flatMap?

    data.flatMap{case (k, v) => v.drop(1).dropRight(1).split(", ").map((k, _))}