html apache-spark pyspark apache-spark-sql

find and replace html encoded characters in pyspark dataframe column

I have a dataframe created by reading from a parquet file. There are a couple of string type columns that contain html encodings like & > " ext…

I need to find and replace these with their representative ascii characters, i.e:

& with &
> with >
" with ""

I have used regexp_replace("colA", "&", "&") which has worked for one instance, until we discovered the column contains many more such encodings. Ideally, I would like to create a dictionary and then use this with regexp_replace against each column where I need to handle these html encodings.

I have searched the internet, but couldn't find any examples of using regexp_replace with a dictionary.

a) Is it even possible to use dictionary with regexp_replace ? b) is there any other solution, other than doing multiple expressions for each encoding

I am using Apache Spark 3.0.1 and PySpark on Databricks runtime 7.3

Solution

In my opinion, it's better to use a python library that handles all html encoded characters, than to define your own dictionary, you can use the html python library for that:

df = spark.createDataFrame(spark.sparkContext.parallelize([
    ["me &amp; you"],
]), ['text'])

def unescape_html(value):
    return html.unescape(value) if isinstance(value, str) else value

unescape_html_udf = udf(unescape_html)
df.withColumn("text", unescape_html_udf(col("text"))).show()

+--------+
|    text|
+--------+
|me & you|
+--------+

If you want to apply this to all columns, you can just loop on all columns and call the udf the same way:

df = spark.createDataFrame(spark.sparkContext.parallelize([
    [1, "me &amp; you"],
    [2, "&gt; &quot;"],
]), ['id', 'text'])

def unescape_html(value):
    return html.unescape(value) if isinstance(value, str) else value

unescape_html_udf = udf(unescape_html)
for column in df.columns:
    df = df.withColumn(column, unescape_html_udf(col(column)))
df.show()

+---+--------+
| id|    text|
+---+--------+
|  1|me & you|
|  2|     > "|
+---+--------+