Search code examples
regexapache-sparkpyspark

What flavor of regular expression does Apache Spark SQL use for "rlike"?


I am new to Python and Spark, spinning up on PySpark under Anaconda on Windows 10. I am going through the rlike section #7 of this tutorial on the Where Filter method of the DataFrame class. There in nothing in the Python documentation about the supported regular expressions. I sometimes find the Java documentation a bit more informative than the PySpark documentation, but it didn't seem to be the case here.

The Python documentation linked to above says that "extended regex" is supported. I consider Wikipedia to contain the most generic (vendor/tool/platform-independent) information on this. However, it does not cover the case-sensitive specifier in section #7 of the above tutorial. Presumably, this is (?i).

What flavor of regular expression does Apache Spark SQL use for rlike?


Solution

  • According to this section in the official documentation for SQL APIs, the flavor used is that of Java. I'm no expert on Spark (this is precisely the first time I read its docs), but it seems that this documentation applies for all kinds of Spark APIs related to rlike and friends.

    (?i) is an inline modifier, which enable the case-insensitive mode for all expressions following it. Similarly, to turn its effect off, you can use (?-i). (?i)expression(?-i) can also be written as (?i:expression). These modifiers can be placed anywhere in the pattern, not just at the start, although it is good practice to do so where possible. More information can be found at this answer.