Search code examples
apache-sparkduplicatesgraphstream

Removing bi-directional unique rows from text file


I have a text file as follows:

1    3
2    5
3    6
4    5
5    4
6    1
7    2

The above file represents the edges in an undirected graph. I want to remove the duplicate edges in the graph. In the above given example I want to remove either 4,5 or 5,4 as they represent the same edge in graph and hence causes duplication. I am trying to visualize the graph from the file using Graphstream using the GraphX library in Apache Spark. But due to the presence of duplicate nodes as explained above it gives an error as follows

org.graphstream.graph.EdgeRejectedException: Edge 4[5--4] was rejected by node 5

What would be the best way to remove such duplicates from the text file?


Solution

  • You can use convertToCanonicalEdges method from GraphOps. It

    • Converts bi-directional edges into uni-directional.
    • Rewrites the vertex ids of edges so that srcIds are smaller than dstIds, and merges the duplicated edges.

    In your case:

    val graph = Graph.fromEdgeTuples(sc.parallelize(
      Seq((1, 3), (2, 5), (3, 6), (4, 5), (5, 4), (6, 1), (7, 2))), -1)
    
    graph.convertToCanonicalEdges().edges.collect.foreach(println)
    

    with result:

    Edge(3,6,1)
    Edge(1,6,1)
    Edge(1,3,1)
    Edge(2,5,1)
    Edge(2,7,1)
    Edge(4,5,1)