Search code examples
scalaapache-sparktransformationrdd

how to distinct the spark rdd by the key?


now, I have a RDD, which the records in the RDD are as follows:

key1  value1
key1  value2
key2  value3
key3  value4
key3  value5

I want to get the RDD records which have different keys ,as follows:

key1  value1
key2  value3
key3  value4

I can just use the spark-core APIs and don't aggregate values of the same key.


Solution

  • You could do this with PairRDDFunctions.reduceByKey. Assuming you have an RDD[(K, V)]:

    rdd.reduceByKey((a, b) => if (someCondition) a else b)