I have been trying to join two dataframes using the following list of join key passed as a list and I want to add the functionality to join on a subset of the keys if one of the key value is null
I have been trying to join two dataframes df_1 and df_2.
data1 = [[1,'2018-07-31',215,'a'],
[2,'2018-07-30',None,'b'],
[3,'2017-10-28',201,'c']
]
df_1 = sqlCtx.createDataFrame(data1,
['application_number','application_dt','account_id','var1'])
and
data2 = [[1,'2018-07-31',215,'aaa'],
[2,'2018-07-30',None,'bbb'],
[3,'2017-10-28',201,'ccc']
]
df_2 = sqlCtx.createDataFrame(data2,
['application_number','application_dt','account_id','var2'])
The code I use to join is this:
key_a = ['application_number','application_dt','account_id']
new = df_1.join(df_2,key_a,'left')
The output for the same is:
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b|null|
+------------------+--------------+----------+----+----+
My concern here is, in the case where account_id is null, the join should still work by comparing other 2 keys.
The required output should be like this:
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b| bbb|
+------------------+--------------+----------+----+----+
I have found a similar approach to do so by using the statement:
join_elem = "df_1.application_number ==
df_2.application_number|df_1.application_dt ==
df_2.application_dt|F.coalesce(df_1.account_id,F.lit(0)) ==
F.coalesce(df_2.account_id,F.lit(0))".split("|")
join_elem_column = [eval(x) for x in join_elem]
But the design consideration do not allow me to use a full join expression and i am stuck with using the list of column names as join-key.
I have been trying to find a way to accommodate this coalesce thing into this list itself but have not found any success so far.
I would call this solution a workaround.
The issue here is that we have Null
value for one of the keys in the DataFrame
and the OP wants that rest of the key columns to be used instead. Why not assign an arbitrary value to this Null
and then apply the join. Effectively this would be same thing like making a join on the remaining two keys.
# Let's replace Null with an arbitrary value, which has
# little chance of occurring in the Dataset. For eg; -100000
df1 = df1.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))
df2 = df2.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))
# Do a FULL Join
df = df1.join(df2,['application_number','application_dt','account_id'],'full')
# Replace the arbitrary value back with Null.
df = df.withColumn('account_id', when(col('account_id')== -100000, None).otherwise(col('account_id')))
df.show()
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 2| 2018-07-30| null| b| bbb|
| 3| 2017-10-28| 201| c| ccc|
+------------------+--------------+----------+----+----+