Search code examples
apache-sparkpysparkregexp-replace

regexp_replace on PySpark used on two columns


I'd like to perform an atypical regexp_replace in PySpark based on two columns: I have in one attribute the address and in another one the city and I would like to use the city attribute to delete it from the address, when is present. I have written a function to do this:

df = spark.createDataFrame(
[
    (1, 'hügelstrasse 34, ansbach', 'ansbach'),
    (2, 'panton st. 2, london', 'london')
],
   ('id', 'address', 'city')
)

def dropCityAddress(street, city):

    new = regexp_replace(street, city, '')

    return(new)

df.withColumn('newaddress', dropCityAddress(col('address'), col('city')))

but the city object is not iterable. The desired output would be a new column without the city in the address (I am not interested in commas or other stuff, just deleting the city). I will perform this task on a big database, so a solution based on something like a collect action would not be suited for this problem.

Is there a way to perform this task?


Solution

  • Check below code.

    df.withColumn("newaddress",expr("regexp_replace(address,city,'')")).show(false)
    
    +---+------------------------+-------+-----------------+
    |id |address                 |city   |newaddress       |
    +---+------------------------+-------+-----------------+
    |1  |hügelstrasse 34, ansbach|ansbach|hügelstrasse 34, |
    |2  |panton st. 2, london    |london |panton st. 2,    |
    +---+------------------------+-------+-----------------+