I am new to Python and Spark, spinning up on PySpark under Anaconda on
Windows 10. I am going through the rlike
section #7 of this
tutorial on
the Where Filter method of the DataFrame class. There in nothing in
the Python
documentation
about the supported regular expressions. I sometimes find the Java
documentation a bit more informative than the PySpark documentation,
but it didn't seem to be the
case
here.
The Python documentation linked to above says that "extended regex"
is supported. I consider
Wikipedia
to contain the most generic (vendor/tool/platform-independent) information on this.
However, it does not cover the case-sensitive specifier in section #7
of the above
tutorial.
Presumably, this is (?i)
.
What flavor of regular expression does Apache Spark SQL use for rlike
?
According to this section in the official documentation for SQL APIs, the flavor used is that of Java. I'm no expert on Spark (this is precisely the first time I read its docs), but it seems that this documentation applies for all kinds of Spark APIs related to rlike
and friends.
(?i)
is an inline modifier, which enable the case-insensitive mode for all expressions following it. Similarly, to turn its effect off, you can use (?-i)
. (?i)expression(?-i)
can also be written as (?i:expression)
. These modifiers can be placed anywhere in the pattern, not just at the start, although it is good practice to do so where possible. More information can be found at this answer.