I need to use the left-anti join to pull all the rows that do not match but, the problem is that the left-anti join is not flexible in terms of selecting columns, because it will only ever allow me select columns from the left dataframe... and I need to keep some columns from the right dataframe as well. So I tried:
cols_to_keep = ['col1_left_df', 'col2_left_df', 'col3_left_df',
'col1_right_df', 'col2_right_df', 'col3_right_df']
non_matches = (
left_df.join(right_df, [left_df.['col1_left_df'] == right_df.['col1_right_df']], how = 'lefouter')
.filter(col('col1_left_df').isNull()) & col('col1_right_df').isNull())
.select(cols_to_keep)
)
This allows me to choose columns from both left and right dataframes and did not return nay errors. However, due to the size and both - the known and unknown complexity of the actual data - I am still in the process of checking if it worked as intended or not (which is taking me ages).
My question: is there an alternative way of replicating the left-anti join which would let me select columns from both left and right dataframes?
Example inputs:
from pyspark.sql import functions as F
df_left = spark.createDataFrame(
[(1111, 4444),
(2222, 5555),
(None, 6666)],
['left_a', 'left_b'])
df_right = spark.createDataFrame(
[(1111, 7777),
(3333, 8888),
(None, 9999)],
['right_a', 'right_b'])
cols_to_keep = ['left_a', 'left_b', 'right_a', 'right_b']
If we do the left anti join, we get the following dataframe:
df_left.join(df_right, df_left.left_a == df_right.right_a, 'leftanti').show()
# +------+------+
# |left_a|left_b|
# +------+------+
# | null| 6666|
# | 2222| 5555|
# +------+------+
If we use your script, we get this:
df_non_matches = (
df_left
.join(df_right, df_left.left_a == df_right.right_a, how='leftouter')
.filter(F.col('left_a').isNull() & F.col('right_a').isNull())
.select(cols_to_keep)
)
df_non_matches.show()
# +------+------+-------+-------+
# |left_a|left_b|right_a|right_b|
# +------+------+-------+-------+
# | null| 6666| null| null|
# +------+------+-------+-------+
It can be seen that your algorithm implementing the full left outer join is not equivalent to the left anti join.
You can do it your way with the left outer join, but I think you don't need one of filter conditions:
df_non_matches = (
df_left
.join(df_right, df_left.left_a == df_right.right_a, how='leftouter')
.filter(F.isnull('right_a'))
.select(cols_to_keep)
)
df_non_matches.show()
# +------+------+-------+-------+
# |left_a|left_b|right_a|right_b|
# +------+------+-------+-------+
# | null| 6666| null| null|
# | 2222| 5555| null| null|
# +------+------+-------+-------+
Also, you could do the left anti join, but when it's finished just add missing null columns from the right dataframe:
df_non_matches = (
df_left
.join(df_right, df_left.left_a == df_right.right_a, 'leftanti')
.select(*[c if c in df_left.columns else F.lit(None).alias(c) for c in cols_to_keep])
)
df_non_matches.show()
# +------+------+-------+-------+
# |left_a|left_b|right_a|right_b|
# +------+------+-------+-------+
# | null| 6666| null| null|
# | 2222| 5555| null| null|
# +------+------+-------+-------+