Search code examples
regexapache-sparkunicodeapache-spark-sqlregexp-replace

Using \P{C} in Spark SQL regexp_replace


I understand \P{C} represents "invisible control characters and unused code points" https://www.regular-expressions.info/unicode.html

When I do this, (in a databricks notebook) it works fine:

%sql
SELECT regexp_replace('abcd', '\\P{C}', 'x')

But the following fails (both %python and %scala):

%python 
s = "SELECT regexp_replace('abcd', '\\P{C}', 'x')"
display(spark.sql(s))

java.util.regex.PatternSyntaxException: Illegal repetition near index 0
P{C}
^

The SQL command also works fine in HIVE. I also tried escaping the curly braces as suggested here, but no use.

Is there anything else I am missing? Thanks.


Solution

  • Spark-Sql Api: Try adding 4 backslashes to escape 1 \

    spark.sql("SELECT regexp_replace('abcd', '\\\\P{C}', 'x')").show()
    //+------------------------------+
    //|regexp_replace(abcd, \P{C}, x)|
    //+------------------------------+
    //|                          xxxx|
    //+------------------------------+
    

    spark.sql("SELECT string('\\\\')").show()
    //+-----------------+
    //|CAST(\ AS STRING)|
    //+-----------------+
    //|                \|
    //+-----------------+
    

    (Or)

    enable escapedStringLiterals property to fall back to Spark-1.6 string literal

    spark.sql("set spark.sql.parser.escapedStringLiterals=true")
    spark.sql("SELECT regexp_replace('abcd', '\\P{C}', 'x')").show()
    //+------------------------------+
    //|regexp_replace(abcd, \P{C}, x)|
    //+------------------------------+
    //|                          xxxx|
    //+------------------------------+
    

    In DataFrame-Api: add 2 backslashes \\ to escape 1 \

    df.withColumn("dd",regexp_replace(lit("abcd"), "\\P{C}", "x")).show()
    //+-----+----+
    //|value|  dd|
    //+-----+----+
    //|    1|xxxx|
    //+-----+----+
    

    df.withColumn("dd",lit("\\")).show()
    //+-----+---+
    //|value| dd|
    //+-----+---+
    //|    1|  \|
    //+-----+---+