If I have two dataframes:
df_1 =
my_id col_1 col_2 col_3
10001 1 2 3
10002 6 1 3
10003 7 2 7
10004 9 3 4
10005 1 2 8
10006 1 9 5
df_2 =
my_new_id col_4 col_5 col_final
10005 1 2 10
10006 6 1 10
And I want:
df_otput =
my_id col_1 col_2 col_3 col_final
10005 1 2 8 10
10006 1 9 5 10
I would do a left join with the smaller df_2
on the left. But do I have to manually write all the columns that I want in the output (in reality df_1
has dozens of columns)?
df_output = df_2.join(df_1.select("col_1", "col_2", "col_3"), df_2["my_new_id"] == df_1["my_id"], how="left").drop('my_new_id')
Also, my code above does not include col_final
. How could I solve this correctly?
You can use the asterisk with df_1
:
df_output = (df_2
.join(df_1, df_2.my_new_id == df_1.my_id, 'left')
.select(df_1['*'], 'col_final')
)