Combining two JavaRDD for next reducer job

I am trying to combine two JavaPairRDD, so that I can do a reduceByKey job on the combined dataset, like below:

JavaPairRDD data1 = ...

JavaPairRDD data2 = ...

I want to have a new dataset which contains both data1 and data2, something like:

JavaPairRDD data_total = (data1 + data2)

So that I can do a reduce by key on the combined dataset:

JavaPairRDD output = data_total.reduceByKey(... my reduce function ...);

What's the best way to combine data1 and data2? Or what's the best approach to this problem?

Thanks a lot!

Solution

You can use union:

// Return the union of this RDD and another one.
union(JavaPairRDD<K,V> other)

Fetching data from REST API to Spark Dataframe using Pyspark
Create column using Spark pandas_udf, with dynamic number of input columns
How to find position of substring column in another column using PySpark?
How to correctly read a CSV file while escaping delimiter comma placed within square brackets using Apache Spark and Scala?
SPARK SQL Equivalent of Qualify + Row_number statements
How to drop a column from a Databricks Delta table?
Converting all columns in spark df from decimal to float for pandas conversion
How to create a copy of a dataframe in pyspark?
Read previous Spark APIs
Unexpected output from least (source data includes nulls)
How to use PySpark UDF in Java / Scala Spark project
How does spark load python package depends on the external library?
Disable PySpark to print info when running
PySpark: How To Deserialise A Proto Payload From A Kafka Message With Variable Message Type
Multiple Sinks Processing not persisting in Databricks Community Edition
How to find longest sequence of consecutive dates?
graph.triplets seems not work as expected
PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation
How do I access the fields within a VARIANT column while reading from Kafka using Spark?
pyspark: how to specify rebalance partitioning hint with columns
Is Python UDF still inefficient in Spark?
How to import AnalysisException in PySpark
Updated scalapb class fails to render old dataframe
Create a Column with Values Based on an Array of Column Names Provided in Another Column
How to join on multiple columns in Pyspark?
Databricks: Issue while creating spark data frame from pandas
How to use SparkSQLparse in a simple FROM analysis?
UnsatisfiedLinkError while writing to S3 using Staging S3A Committer on Windows
How to install postgresql in my docker image?
Why Spark won't store Broadcasted data in off heap memory? Why does it store one copy per executor?