Search code examples
dataframeapache-sparkjoinpysparkregexp-replace

How to join two data frames using regexp_replace


I want to join two dataframes by removing the matching records using column cust_id. Some of the cust_id have leading zeros. So I need to match by removing zeros in the 'ON' clause. Tried the below query, but it's giving error in Databricks notebook.

PS: I don't want to create another DF1 with zeros removed.

enter image description here

Query:

df1 = df1.join(df2,[regexp_replace("cust_id", r'^[0]*','')], "leftanti")

Py4j.Py4JException: Method and Class java.lang.string does not exist


Solution

  • The following works, but the output that you provided will not be reached using "leftanti" join: S15 matches S15 from another table, so it is removed too. In the example that you provided, "leftanti" join does not return any row.

    from pyspark.sql import functions as F
    df1 = spark.createDataFrame(
        [(1, 'S15', 'AAA'),
         (2, '00767', 'BBB'),
         (3, '03246', 'CCC')],
        ['ID', 'cust_id', 'Name'])
    df2 = spark.createDataFrame(
        [(1, 'S15', 'AAA'),
         (2, '767', 'BBB'),
         (3, '3246', 'CCC')],
        ['ID', 'cust_id', 'Name'])
    df = df1.join(df2, df2.cust_id == F.regexp_replace(df1.cust_id, r'^0*', ''), "leftanti")
    
    df.show()
    # +---+-------+----+
    # | ID|cust_id|Name|
    # +---+-------+----+
    # +---+-------+----+