Search code examples
scaladataframeleft-joinfoldleft

Join multiple dataframes in scala


I have two variables. One is a Dataframe and other is a List[Dataframe]. I wish to perform a join on these. At the moment I am using the following appoach:

def joinDfList(SingleDataFrame: DataFrame, DataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame = {

    var joinedDf = SingleDataFrame
    DataFrameList.foreach(
      Df => {
        joinedDf = joinedDf.join(Df, groupByCols, "left_outer")
      }
    )
    joinedDf.na.fill(0.0)
}

Is there an approach where we can skip usage of "var" and instead of "foreach" use "foldleft"?


Solution

  • You can simple write it without vars using foldLeft:

    def joinDfList(singleDataFrame: DataFrame, dataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame = 
      dataFrameList.foldLeft(singleDataFrame)(
        (dfAcc, nextDF) => dfAcc.join(nextDF, groupByCols, "left_outer")
      ).na.fill(0.0)
    

    in this code dfAcc will be always join with new DataFrame from dataFrameList and in the end you will get one DataFrame

    Important! be careful, using too many joins in one job might be a reason of performance degradation.