regex apache-spark unicode apache-spark-sql regexp-replace

Using \P{C} in Spark SQL regexp_replace

I understand \P{C} represents "invisible control characters and unused code points" https://www.regular-expressions.info/unicode.html

When I do this, (in a databricks notebook) it works fine:

%sql
SELECT regexp_replace('abcd', '\\P{C}', 'x')

But the following fails (both %python and %scala):

%python 
s = "SELECT regexp_replace('abcd', '\\P{C}', 'x')"
display(spark.sql(s))

java.util.regex.PatternSyntaxException: Illegal repetition near index 0
P{C}
^

The SQL command also works fine in HIVE. I also tried escaping the curly braces as suggested here, but no use.

Is there anything else I am missing? Thanks.

Solution

Spark-Sql Api: Try adding 4 backslashes to escape 1 \

spark.sql("SELECT regexp_replace('abcd', '\\\\P{C}', 'x')").show()
//+------------------------------+
//|regexp_replace(abcd, \P{C}, x)|
//+------------------------------+
//|                          xxxx|
//+------------------------------+

spark.sql("SELECT string('\\\\')").show()
//+-----------------+
//|CAST(\ AS STRING)|
//+-----------------+
//|                \|
//+-----------------+

(Or)

enable escapedStringLiterals property to fall back to Spark-1.6 string literal

spark.sql("set spark.sql.parser.escapedStringLiterals=true")
spark.sql("SELECT regexp_replace('abcd', '\\P{C}', 'x')").show()
//+------------------------------+
//|regexp_replace(abcd, \P{C}, x)|
//+------------------------------+
//|                          xxxx|
//+------------------------------+

In DataFrame-Api: add 2 backslashes \\ to escape 1 \

df.withColumn("dd",regexp_replace(lit("abcd"), "\\P{C}", "x")).show()
//+-----+----+
//|value|  dd|
//+-----+----+
//|    1|xxxx|
//+-----+----+

df.withColumn("dd",lit("\\")).show()
//+-----+---+
//|value| dd|
//+-----+---+
//|    1|  \|
//+-----+---+