I understand \P{C} represents "invisible control characters and unused code points" https://www.regular-expressions.info/unicode.html
When I do this, (in a databricks notebook) it works fine:
%sql
SELECT regexp_replace('abcd', '\\P{C}', 'x')
But the following fails (both %python and %scala):
%python
s = "SELECT regexp_replace('abcd', '\\P{C}', 'x')"
display(spark.sql(s))
java.util.regex.PatternSyntaxException: Illegal repetition near index 0
P{C}
^
The SQL command also works fine in HIVE. I also tried escaping the curly braces as suggested here, but no use.
Is there anything else I am missing? Thanks.
Spark-Sql Api:
Try adding 4 backslashes to escape 1 \
spark.sql("SELECT regexp_replace('abcd', '\\\\P{C}', 'x')").show()
//+------------------------------+
//|regexp_replace(abcd, \P{C}, x)|
//+------------------------------+
//| xxxx|
//+------------------------------+
spark.sql("SELECT string('\\\\')").show()
//+-----------------+
//|CAST(\ AS STRING)|
//+-----------------+
//| \|
//+-----------------+
(Or)
enable escapedStringLiterals
property to fall back to Spark-1.6 string literal
spark.sql("set spark.sql.parser.escapedStringLiterals=true")
spark.sql("SELECT regexp_replace('abcd', '\\P{C}', 'x')").show()
//+------------------------------+
//|regexp_replace(abcd, \P{C}, x)|
//+------------------------------+
//| xxxx|
//+------------------------------+
In DataFrame-Api:
add 2 backslashes \\
to escape 1 \
df.withColumn("dd",regexp_replace(lit("abcd"), "\\P{C}", "x")).show()
//+-----+----+
//|value| dd|
//+-----+----+
//| 1|xxxx|
//+-----+----+
df.withColumn("dd",lit("\\")).show()
//+-----+---+
//|value| dd|
//+-----+---+
//| 1| \|
//+-----+---+