Search code examples
spark-graphx

Apache Spark Graphx :Source and Destination share the shame VertexId but represnet different things


I have a file with srcId -> dstId values that represent the edges of a graph which i load with GraphLoader edgeListFile, the source represents users and the destination items , in some occasions the srcId and the dstId are equal so there are errors in some algorithms like when i want to collect the neighbor of each vertex. Can i do something to separate the users from the items and also not loose any information


Solution

  • Each GraphX vertex must be defined by an unique long value. If the source and destination IDs represent different things, you need to transform them with some operation to make sure they are distinct. For example, assuming you have read your data into an RDD[(Long, Long)], you could do:

    import org.apache.spark.rdd.RDD
    import org.apache.spark.graphx.{Edge, Graph}
    
    val userMaxID = rdd.map(_._1).distinct.max
    val edges: RDD[Edge[Int]] = rdd.map { 
      case (userID, itemID) => Edge(userID, itemID + userMaxID, 0) 
    }
    
    val g = Graph.fromEdges(edges, 0)
    

    Then you will have a graph where all item IDs will be their original ID + the maximum ID of an user (if the IDs can be 0, you need to add an extra 1).

    Note that this is just a suggestion, the idea is that you need to transform your IDs in a way that no item can have the same ID as an user. Also, you may want to keep a way to know if a given vertex is an user or an item; in my suggestion, all vertices with ID <= userMaxID would be users, whereas all vertices with ID > userMaxID would be items.