Search code examples

Using \P{C} in Spark SQL regexp_replace

I understand \P{C} represents "invisible control characters and unused code points"

When I do this, (in a databricks notebook) it works fine:

SELECT regexp_replace('abcd', '\\P{C}', 'x')

But the following fails (both %python and %scala):

s = "SELECT regexp_replace('abcd', '\\P{C}', 'x')"

java.util.regex.PatternSyntaxException: Illegal repetition near index 0

The SQL command also works fine in HIVE. I also tried escaping the curly braces as suggested here, but no use.

Is there anything else I am missing? Thanks.


  • Spark-Sql Api: Try adding 4 backslashes to escape 1 \

    spark.sql("SELECT regexp_replace('abcd', '\\\\P{C}', 'x')").show()
    //|regexp_replace(abcd, \P{C}, x)|
    //|                          xxxx|

    spark.sql("SELECT string('\\\\')").show()
    //|CAST(\ AS STRING)|
    //|                \|


    enable escapedStringLiterals property to fall back to Spark-1.6 string literal

    spark.sql("set spark.sql.parser.escapedStringLiterals=true")
    spark.sql("SELECT regexp_replace('abcd', '\\P{C}', 'x')").show()
    //|regexp_replace(abcd, \P{C}, x)|
    //|                          xxxx|

    In DataFrame-Api: add 2 backslashes \\ to escape 1 \

    df.withColumn("dd",regexp_replace(lit("abcd"), "\\P{C}", "x")).show()
    //|value|  dd|
    //|    1|xxxx|

    //|value| dd|
    //|    1|  \|