I have 2 DataSet<Row>
I want the resultant to be a DataSet which is combined of both Resultant DataSet<Row>
using Java Spark
I have tried to do simple join
dataset1.join(dataset2).show();
but I am getting an error
Detected implicit cartesian product for INNER join between logical plans
Other way I tied to do as full outer Join
dataset1.join(dataset2,dataset1.col("A").equalTo(dataset2.col("B")),"outer").show();
but it is coming with null entries
But I want resultant to be something like this one
Can someone suggest me other way of doing it
For join rows with the same positions, position can be added by Window row_number
function to each dataset, and outer join performed by this column. On Scala, guess, can be converted to Java easily:
val first = Seq(
("as", 7),
("bs", 5),
("cs", 3),
("ds", 1)
).toDF("A", "B")
val second = Seq(
("rs", 1, 3, 5),
("rd", 3, 6, 8)
).toDF("C", "D", "E", "F")
val naturalOrderWindow = Window.orderBy(lit(1))
val result = first
.withColumn("position", row_number().over(naturalOrderWindow))
.join(second.withColumn("position", row_number().over(naturalOrderWindow)), Seq("position"), "outer")
.drop("position")
Result is:
+---+---+----+----+----+----+
|A |B |C |D |E |F |
+---+---+----+----+----+----+
|as |7 |rs |1 |3 |5 |
|bs |5 |rd |3 |6 |8 |
|cs |3 |null|null|null|null|
|ds |1 |null|null|null|null|
+---+---+----+----+----+----+
Note: Window without partition is used, performance can be bad.