eclipse scala apache-spark rdd spark-shell

Pair RDD tuple comparison

I am learning how use spark and scala and I am trying to write a scala spark program that receives and input of string values such as:

I initially create my pair rdd with:

val myRdd = sc.textFile(args(0)).map(line=>(line.split("\\s+"))(0),line.split("\\s+")(1))).distinct()

Now this is where I am getting stuck. In the set of values there are instances like (12,13) and (13,12). In the context of the data these two are the same instances. Simply put (a,b)=(b,a).

I need to create an RDD that has one or the other, but not both. So the result, once this is done, would look something like this:

The only way I can see it as of right now is that I need to take one tuple and compare it with the rest in the RDD to make sure it isn't the same data just swapped.

Solution

The numbers just need to be sorted before creating a tuple.

val myRdd = sc.textFile(args(0))
  .map(line => {
    val nums = line.split("\\s+").sorted
    (nums(0), nums(1))
  }).distinct