Search code examples
pythonregexapache-sparkpyspark

Back-ticks in DataFrame.colRegex?


For PySpark, I find back-ticks enclosing regular expressions for DataFrame.colRegex() here, here, and in this SO question. Here is the example from the DataFrame.colRegex doc string:

df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()
+----+
|Col2|
+----+
|   1|
|   2|
|   3|
+----+

The answer to the SO question doesn't show back-ticks for Scala. It refers to the Java documentation for the Pattern class, but that doesn't explain back-ticks.

This page indicates the use of back-ticks in Python to represent the string representation of the adorned variable, but that doesn't apply to a regular expression.

What is the explanation for the back-ticks?


Solution

  • The back-ticks are used to delimit the column name in case it includes special characters. For example, if you had a column called column-1 and you try

    SELECT column-1 FROM mytable
    

    You will probably get a

    non-existent column 'column'

    error as the interpreter will treat that as SELECT (column) - 1 FROM mytable. Instead, you can delimit the column name with back-ticks to get around that issue:

    SELECT `column-1` FROM mytable
    

    Note that the PySpark doc string for DataFrame.colRegex() explains the argument as "column name specified as a regex". It is not just a regular expression that possibly matches portions of one or more column names. In the latter case, back-ticks would be confusing because back-ticks aren't used to delimit portions of a column name. Therefore, SQL's adornment of entire column names with back-ticks makes sense for colRegex because the argument is the matching of an entire column name rather than just a substring.