I'm using PySpark 2.4.
I have a dataframe like below as input:
ceci_p| ceci_l|ceci_stok|
-------+-------+---------+
SFIL401| BPI202| BPI202|
BPI202| CDC111| BPI202|
LBP347|SFIL402| SFIL402|
LBP347|SFIL402| LBP347|
-------+-------+---------+
I want to detect which ceci_stok
values exist in both ceci_l
and ceci_p
columns using a join (maybe a self join).
For example: ceci_stok = BPI202
exists in both ceci_l
and ceci_p
.
I want to create a new dataframe as a result that contains ceci_stok
which exist in both ceci_l
and ceci_p
.
#c reate data for testing
data = [("SFIL401","BPI202","BPI202"),
("BPI202","CDC111","BPI202"),
("LBP347","SFIL402","SFIL402"),
("LBP347","SFIL402","LBP347")]
data_schema = ["ceci_p","ceci_l","ceci_stok"]
df = spark.createDataFrame(data=data, schema = data_schema)
ceci_p = df.cache()\ #don't forget to cache table you reference multiple times.
.select( df.ceci_p.alias("join_key") )\ #rename for union
.distinct()
ceci_l = df\
.select( df.ceci_l.alias("join_key") )\ #rename for union
.distinct()
vals = ceci_l.join(ceci_p,"join_key").distinct() # get unique values to both columns your interested in
df.join( vals, df.ceci_stok == vals.join_key ).show()
+-------+-------+---------+--------+
| ceci_p| ceci_l|ceci_stok|join_key|
+-------+-------+---------+--------+
|SFIL401| BPI202| BPI202| BPI202|
| BPI202| CDC111| BPI202| BPI202|
+-------+-------+---------+--------+