I tried to replace part of the column text with "?". But I got an error. I'd like to use the native dataframe in spark.
TypeError: Column is not iterable
data = [("John Doe", "123 Main St, Anytown, USA"),
("Jane Smith", "456 High St, Somewhere, USA")]
df = spark.createDataFrame(data, ["name", "address"])
from pyspark.sql.functions import regexp_replace, col
start = 3
end = 10
df_masked = df.withColumn("address_masked", regexp_replace(col("address"), substring("address", start, end), "??????????"))
expected result
name | address_masked |
---|---|
John Doe | 123????????, Anytown, USA |
Jane Smith | 456??????????, Somewhere, USA |
Try with expr()
to iterate through the column using substring
function
Example:
df_masked = df.withColumn("address_masked", expr(f'regexp_replace(address, substring(address, {start}, {end}), "??????????")'))
#name address address_masked
#John Doe 123 Main St, Anytown, USA 12?????????? Anytown, USA
#Jane Smith 456 High St, Somewhere, USA 45?????????? Somewhere, USA