Search code examples
pythonapache-sparkpysparkapache-spark-sqlregexp-replace

match with group of stop words and replace with empty space in Pyspark


I have address column in a table as shown below

 >>> spark.sql("select add1 from test").show(10,False)
+---------------------------+
|add1                       |
+---------------------------+
|PO BOX 1111DUBAI  UAE      |
|P.O. BOX 2222DUBAI - U.A.E|
+---------------------------+

I have to match with group of words and remove them from the column just for comparing purpose

I was able to replace a single word with empty space like below ('UAE' in this case)

spark.sql("select regexp_replace(add1,'UAE','') from test").show(10,False)
+---------------------------+
|regexp_replace(add1, UAE, )|
+---------------------------+
|PO BOX 1111DUBAI           |
|P.O. BOX 2222DUBAI - U.A.E|
+---------------------------+

But I would like to remove all special characters and a group of stop words like {'UAE','U.A.E', 'U A E', 'PO', 'P O'}

Desired Output :

spark.sql("select regexp_replace(add1,'UAE','') from test").show(10,False)
+---------------------------+
|regexp_replace(add1, UAE, )|
+---------------------------+
|1111DUBAI                  |
|2222DUBAI                  |
+---------------------------+

Can someone help me to achieve this pls , Thanks !


Solution

  • here is how you can do it. Hoping it solve all your cases.

    from pyspark.sql import functions as F
    
    stop_words = {'UAE','U.A.E', 'U A E', 'PO', 'P O', 'BOX'}
    
    df2 = df.withColumn(
        "add1",
        F.regexp_replace(F.col("add1"), r"[^a-zA-Z0-9\s]", "")
    )
    
    for stop_word in stop_words: 
        df2 = df2.withColumn(
            "add1",
            F.trim(F.regexp_replace(F.col("add1"), stop_word, ""))
        )
    
    df2.show()
    
    +---------+
    |     add1|
    +---------+
    |1111DUBAI|
    |2222DUBAI|
    +---------+