Search code examples
javaapache-sparkjoin

Join 2 DataSet<Row> in java spark to merge into single DataSet<Row>


I have 2 DataSet<Row>

enter image description here

enter image description here

I want the resultant to be a DataSet which is combined of both Resultant DataSet<Row>

enter image description here

using Java Spark

I have tried to do simple join

dataset1.join(dataset2).show();

but I am getting an error

Detected implicit cartesian product for INNER join between logical plans

Other way I tied to do as full outer Join

dataset1.join(dataset2,dataset1.col("A").equalTo(dataset2.col("B")),"outer").show();

but it is coming with null entries

enter image description here

But I want resultant to be something like this one

enter image description here

Can someone suggest me other way of doing it


Solution

  • For join rows with the same positions, position can be added by Window row_number function to each dataset, and outer join performed by this column. On Scala, guess, can be converted to Java easily:

    val first = Seq(
      ("as", 7),
      ("bs", 5),
      ("cs", 3),
      ("ds", 1)
    ).toDF("A", "B")
    
    val second = Seq(
      ("rs", 1, 3, 5),
      ("rd", 3, 6, 8)
    ).toDF("C", "D", "E", "F")
    
    val naturalOrderWindow = Window.orderBy(lit(1))
    val result = first
      .withColumn("position", row_number().over(naturalOrderWindow))
      .join(second.withColumn("position", row_number().over(naturalOrderWindow)), Seq("position"), "outer")
      .drop("position")
    

    Result is:

    +---+---+----+----+----+----+
    |A  |B  |C   |D   |E   |F   |
    +---+---+----+----+----+----+
    |as |7  |rs  |1   |3   |5   |
    |bs |5  |rd  |3   |6   |8   |
    |cs |3  |null|null|null|null|
    |ds |1  |null|null|null|null|
    +---+---+----+----+----+----+
    

    Note: Window without partition is used, performance can be bad.