Search code examples
apache-sparkjoinpysparkapache-spark-sqlself-join

Detect existence of column element in multiple other columns using join


I'm using PySpark 2.4.

I have a dataframe like below as input:

 ceci_p| ceci_l|ceci_stok|
-------+-------+---------+
SFIL401| BPI202|   BPI202|
 BPI202| CDC111|   BPI202|
 LBP347|SFIL402|  SFIL402|
 LBP347|SFIL402|   LBP347|
-------+-------+---------+

I want to detect which ceci_stok values exist in both ceci_l and ceci_p columns using a join (maybe a self join).

For example: ceci_stok = BPI202 exists in both ceci_l and ceci_p.

I want to create a new dataframe as a result that contains ceci_stok which exist in both ceci_l and ceci_p.


Solution

  • #c reate data for testing 
    data = [("SFIL401","BPI202","BPI202"),
    ("BPI202","CDC111","BPI202"),
    ("LBP347","SFIL402","SFIL402"),
    ("LBP347","SFIL402","LBP347")]
    
    data_schema = ["ceci_p","ceci_l","ceci_stok"]
    
    df = spark.createDataFrame(data=data, schema = data_schema)
    ceci_p = df.cache()\ #don't forget to cache table you reference multiple times.
     .select( df.ceci_p.alias("join_key") )\ #rename for union
     .distinct()
    ceci_l = df\
     .select( df.ceci_l.alias("join_key") )\ #rename for union
     .distinct()
    vals = ceci_l.join(ceci_p,"join_key").distinct() # get unique values to both columns your interested in
    df.join( vals, df.ceci_stok == vals.join_key ).show()
    +-------+-------+---------+--------+
    | ceci_p| ceci_l|ceci_stok|join_key|
    +-------+-------+---------+--------+
    |SFIL401| BPI202|   BPI202|  BPI202|
    | BPI202| CDC111|   BPI202|  BPI202|
    +-------+-------+---------+--------+