Search code examples
dataframescalaapache-sparkdrop

Spark scala how to remove the columns that are not in common between 2 dataframes


I have 2 dataframes, the first one has 53 columns and the second one has 132 column. I want to compare the 2 dataframes and remove all the columns that are not in common between the 2 dataframes and then display each dataframe containing only those columns that are common.

What I did so far is to get a list of all the column that dont't match, but I don't know how to drop them.

    val diffColumns = df2.columns.toSet.diff(df1.columns.toSet).union(df1.columns.toSet.diff(df2.columns.toSet))

This is getting me a scala.collection.immutable.Set[String]. Now I'd like to use this to drop these columns from each dataframe. Something like that, but this is not working...

    val newDF1 = df1.drop(diffColumns)

Solution

  • The .drop function accepts a list of columns, not the Set object, so you need to convert it to Seq and "expand it" using, the : _* syntax, like, this:

    df.drop(diffColumns.columns.toSet.toSeq: _*)
    

    Also, instead of generating diff, it could be just easier to do intersect to find common columns, and use .select on each dataframe to get the same columns:

    val df = spark.range(10).withColumn("b", rand())
    val df2 = spark.range(10).withColumn("c", rand())
    val commonCols = df.columns.toSet.intersect(df2.columns.toSet).toSeq.map(col)
    df.select(commonCols: _*)
    df2.select(commonCols: _*)