For PySpark, I find back-ticks enclosing regular expressions for
DataFrame.colRegex()
here,
here,
and in this SO
question. Here is the
example from the DataFrame.colRegex
doc string:
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"])
df.select(df.colRegex("`(Col1)?+.+`")).show()
+----+
|Col2|
+----+
| 1|
| 2|
| 3|
+----+
The answer to the SO question
doesn't show back-ticks for Scala. It refers to the Java
documentation for the Pattern
class,
but that doesn't explain back-ticks.
This page indicates the use of back-ticks in Python to represent the string representation of the adorned variable, but that doesn't apply to a regular expression.
What is the explanation for the back-ticks?
The back-ticks are used to delimit the column name in case it includes special characters. For example, if you had a column called column-1
and you try
SELECT column-1 FROM mytable
You will probably get a
non-existent column 'column'
error as the interpreter will treat that as SELECT (column) - 1 FROM mytable
. Instead, you can delimit the column name with back-ticks to get around that issue:
SELECT `column-1` FROM mytable
Note that the PySpark doc string for DataFrame.colRegex()
explains the argument as "column name specified as a regex". It is not just a regular expression that possibly matches portions of one or more column names. In the latter case, back-ticks would be confusing because back-ticks aren't used to delimit portions of a column name. Therefore, SQL's adornment of entire column names with back-ticks makes sense for colRegex
because the argument is the matching of an entire column name rather than just a substring.