Search code examples
eclipsescalaapache-sparkrddspark-shell

Pair RDD tuple comparison


I am learning how use spark and scala and I am trying to write a scala spark program that receives and input of string values such as:

12 13
13 14
13 12
15 16
16 17
17 16

I initially create my pair rdd with:

val myRdd = sc.textFile(args(0)).map(line=>(line.split("\\s+"))(0),line.split("\\s+")(1))).distinct()

Now this is where I am getting stuck. In the set of values there are instances like (12,13) and (13,12). In the context of the data these two are the same instances. Simply put (a,b)=(b,a).

I need to create an RDD that has one or the other, but not both. So the result, once this is done, would look something like this:

12 13
13 14
15 16
16 17

The only way I can see it as of right now is that I need to take one tuple and compare it with the rest in the RDD to make sure it isn't the same data just swapped.


Solution

  • The numbers just need to be sorted before creating a tuple.

    val myRdd = sc.textFile(args(0))
      .map(line => {
        val nums = line.split("\\s+").sorted
        (nums(0), nums(1))
      }).distinct