Search code examples
dataframeapache-sparkpysparkapache-spark-sql

Join two data frames, select all columns from one and some columns from the other


Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other.

Is there a way to replicate the following command:

sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")

by using only pyspark functions such as join(), select() and the like?

I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.


Solution

  • Not sure if the most efficient way, but this worked for me:

    from pyspark.sql.functions import col
    
    df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])
    

    The trick is in:

    [col('a.'+xx) for xx in a.columns] : all columns in a
    
    [col('b.other1'),col('b.other2')] : some columns of b