I'd like to perform an atypical regexp_replace in PySpark based on two columns: I have in one attribute the address and in another one the city and I would like to use the city attribute to delete it from the address, when is present. I have written a function to do this:
df = spark.createDataFrame(
[
(1, 'hügelstrasse 34, ansbach', 'ansbach'),
(2, 'panton st. 2, london', 'london')
],
('id', 'address', 'city')
)
def dropCityAddress(street, city):
new = regexp_replace(street, city, '')
return(new)
df.withColumn('newaddress', dropCityAddress(col('address'), col('city')))
but the city object is not iterable. The desired output would be a new column without the city in the address (I am not interested in commas or other stuff, just deleting the city). I will perform this task on a big database, so a solution based on something like a collect action would not be suited for this problem.
Is there a way to perform this task?
Check below code.
df.withColumn("newaddress",expr("regexp_replace(address,city,'')")).show(false)
+---+------------------------+-------+-----------------+
|id |address |city |newaddress |
+---+------------------------+-------+-----------------+
|1 |hügelstrasse 34, ansbach|ansbach|hügelstrasse 34, |
|2 |panton st. 2, london |london |panton st. 2, |
+---+------------------------+-------+-----------------+