I have a dataframe created by reading from a parquet file. There are a couple of string type columns that contain html encodings like & > "
ext…
I need to find and replace these with their representative ascii characters, i.e:
&
with &
>
with >
"
with ""
I have used regexp_replace("colA", "&", "&")
which has worked for one instance, until we discovered the column contains many more such encodings. Ideally, I would like to create a dictionary and then use this with regexp_replace against each column where I need to handle these html encodings.
I have searched the internet, but couldn't find any examples of using regexp_replace with a dictionary.
a) Is it even possible to use dictionary with regexp_replace ? b) is there any other solution, other than doing multiple expressions for each encoding
I am using Apache Spark 3.0.1 and PySpark on Databricks runtime 7.3
In my opinion, it's better to use a python library that handles all html encoded characters, than to define your own dictionary, you can use the html
python library for that:
df = spark.createDataFrame(spark.sparkContext.parallelize([
["me & you"],
]), ['text'])
def unescape_html(value):
return html.unescape(value) if isinstance(value, str) else value
unescape_html_udf = udf(unescape_html)
df.withColumn("text", unescape_html_udf(col("text"))).show()
+--------+
| text|
+--------+
|me & you|
+--------+
If you want to apply this to all columns, you can just loop on all columns and call the udf the same way:
df = spark.createDataFrame(spark.sparkContext.parallelize([
[1, "me & you"],
[2, "> ""],
]), ['id', 'text'])
def unescape_html(value):
return html.unescape(value) if isinstance(value, str) else value
unescape_html_udf = udf(unescape_html)
for column in df.columns:
df = df.withColumn(column, unescape_html_udf(col(column)))
df.show()
+---+--------+
| id| text|
+---+--------+
| 1|me & you|
| 2| > "|
+---+--------+