Search code examples
htmlapache-sparkpysparkapache-spark-sql

find and replace html encoded characters in pyspark dataframe column


I have a dataframe created by reading from a parquet file. There are a couple of string type columns that contain html encodings like & > " ext…

I need to find and replace these with their representative ascii characters, i.e:

  • & with &
  • > with >
  • " with ""

I have used regexp_replace("colA", "&", "&") which has worked for one instance, until we discovered the column contains many more such encodings. Ideally, I would like to create a dictionary and then use this with regexp_replace against each column where I need to handle these html encodings.

I have searched the internet, but couldn't find any examples of using regexp_replace with a dictionary.

a) Is it even possible to use dictionary with regexp_replace ? b) is there any other solution, other than doing multiple expressions for each encoding

I am using Apache Spark 3.0.1 and PySpark on Databricks runtime 7.3


Solution

  • In my opinion, it's better to use a python library that handles all html encoded characters, than to define your own dictionary, you can use the html python library for that:

    df = spark.createDataFrame(spark.sparkContext.parallelize([
        ["me & you"],
    ]), ['text'])
    
    def unescape_html(value):
        return html.unescape(value) if isinstance(value, str) else value
    
    unescape_html_udf = udf(unescape_html)
    df.withColumn("text", unescape_html_udf(col("text"))).show()
    
    +--------+
    |    text|
    +--------+
    |me & you|
    +--------+
    

    If you want to apply this to all columns, you can just loop on all columns and call the udf the same way:

    df = spark.createDataFrame(spark.sparkContext.parallelize([
        [1, "me & you"],
        [2, "> ""],
    ]), ['id', 'text'])
    
    def unescape_html(value):
        return html.unescape(value) if isinstance(value, str) else value
    
    unescape_html_udf = udf(unescape_html)
    for column in df.columns:
        df = df.withColumn(column, unescape_html_udf(col(column)))
    df.show()
    
    +---+--------+
    | id|    text|
    +---+--------+
    |  1|me & you|
    |  2|     > "|
    +---+--------+