Search code examples
pythonpyspark

Join two dataframes to get data into another dataframe and create a new column in pyspark


I have a pyspark dataframe

mail_id         e_no
[email protected]     111
[email protected]     222
[email protected]     333

I have another dataframe

email           emp_no
[email protected]     111
                222
                333

I need to fill the emp_no from second dataframe using first dataframe. I need to do a left join to get all records from first dataframe. I also need an extra col to indicate from which dataframe I'm getting the emp_no.

The final dataframe should be like:

email           emp_no  src
[email protected]     111     tbl1
[email protected]     222     tbl2
[email protected]     333     tbl2
final_df = df1.join(df2, df1['e_no'] == df2['emp_no'], 'left')
final_df = final_df.withColumn('src', F.when('mail_id'.isNull() | mail_id =='', 'tbl2')
                                        .when('mail_id' != '', 'tbl1')
                                        .otherwise(F.lit('')))

I'm not getting the right results. Please suggest if any changes are needed.


Solution

  • You can do the following:

    # Performing left join
    result = df1.join(df2, df1.mail_id == df2.email, "left")
    
    # Creating the 'src' column
    result = result.withColumn("src", when(col("emp_no").isNull(), "tbl1").otherwise("tbl2"))
    
    # Selecting and renaming columns to match the final desired format
    final_result = result.select(
        col("mail_id").alias("email"),
        when(col("emp_no").isNull(), col("e_no")).otherwise(col("emp_no")).alias("emp_no"),
        "src"
    )
    
    final_result.show()