I have a pyspark dataframe
mail_id e_no
[email protected] 111
[email protected] 222
[email protected] 333
I have another dataframe
email emp_no
[email protected] 111
222
333
I need to fill the emp_no from second dataframe using first dataframe. I need to do a left join to get all records from first dataframe. I also need an extra col to indicate from which dataframe I'm getting the emp_no.
The final dataframe should be like:
email emp_no src
[email protected] 111 tbl1
[email protected] 222 tbl2
[email protected] 333 tbl2
final_df = df1.join(df2, df1['e_no'] == df2['emp_no'], 'left')
final_df = final_df.withColumn('src', F.when('mail_id'.isNull() | mail_id =='', 'tbl2')
.when('mail_id' != '', 'tbl1')
.otherwise(F.lit('')))
I'm not getting the right results. Please suggest if any changes are needed.
You can do the following:
# Performing left join
result = df1.join(df2, df1.mail_id == df2.email, "left")
# Creating the 'src' column
result = result.withColumn("src", when(col("emp_no").isNull(), "tbl1").otherwise("tbl2"))
# Selecting and renaming columns to match the final desired format
final_result = result.select(
col("mail_id").alias("email"),
when(col("emp_no").isNull(), col("e_no")).otherwise(col("emp_no")).alias("emp_no"),
"src"
)
final_result.show()