Search code examples
dataframeapache-sparkjoinpysparkmultiple-columns

Left joining but keeping all the columns from the dataframe on the right


If I have two dataframes:

df_1 =

my_id    col_1    col_2    col_3
10001        1        2        3
10002        6        1        3
10003        7        2        7
10004        9        3        4
10005        1        2        8
10006        1        9        5

df_2 =

my_new_id    col_4    col_5    col_final
10005        1        2               10
10006        6        1               10

And I want:

df_otput =
    
    my_id    col_1    col_2      col_3     col_final
    10005         1        2        8             10
    10006         1        9        5             10

I would do a left join with the smaller df_2 on the left. But do I have to manually write all the columns that I want in the output (in reality df_1 has dozens of columns)?

df_output = df_2.join(df_1.select("col_1", "col_2", "col_3"), df_2["my_new_id"] == df_1["my_id"], how="left").drop('my_new_id')

Also, my code above does not include col_final. How could I solve this correctly?


Solution

  • You can use the asterisk with df_1:

    df_output = (df_2
        .join(df_1, df_2.my_new_id == df_1.my_id, 'left')
        .select(df_1['*'], 'col_final')
    )