Search code examples
apache-sparkrdd

Spark JavaRDD vs JavaPairRDD?


I am new to Spark and I'm trying to understand the difference between the two JavaRDD vs JavaPairRDD and also how heavy this operation is if I convert a JavaRDD to JavaPairRDD

JavaRDD<Tuple2<String, String>> myRdd // This is my JavaRDD

JavaPairRDD<String, String> pairRDD = JavaPairRDD.fromJavaRDD(myRdd);

Solution

  • There is a distinction because some operations (aggregateByKey, groupByKey, etc) need to have a Key to group by, then a value to put into the grouped result. JavaPairRDD is there to declare the contract to the developer that a Key and Value is required.

    Regular JavaRDD can be used for operations which don't require an explicit Key field. These operations are generic operations on arbitrary element types.

    Take a look at their javadocs to see the functions that available for each.

    JavaRDD

    JavaPairRDD

    Also, converting one to the other should be fast. That would be a narrow transformation because each row is converted to some other row, and no data needs to be sent across the network. Generally your performance will be mostly determined by the wide transformations you do where data must be sent between nodes to colocate rows with the same key on the same worker.