Search code examples
dataframescalaapache-spark

Spark dataset join as tuple of case classes


I am joining two datasets where some of their columns share the same name. I would like the output to be tuples of two case classes, each representing their respective dataset.

joined = dataset1.as("ds1")
.join(dataset2.as("ds2"),dataset1("key") === dataset2("key"),"inner")
// select doesn't work because of the columns which have similar names
.select("ds1.*,ds2.*)
// skipping select and going straight here doesn't work because of the same problem
.as[Tuple2(caseclass1,caseclass2)]

What code is needed to let spark know to map ds1.* to type caseclass1 and ds2.* to caseclass2?


Solution

  • You can leverage the struct function here as follows:

    // create a wrapper case class
    case class Outer(caseclass1: Caseclass1, caseclass2: Caseclass2)
    
    // join and select the columns as struct
    val joined = dataset1.as("ds1")
    .join(dataset2.as("ds2"), dataset1("key") === dataset2("key"), "inner")
    .select(struct("ds1.*").as("caseclass1"), struct("ds2.*").as("caseclass2"))
    .as[Outer]