Search code examples
pythonregexpyspark

PySpark how to get the position of a list of strings within a column but return a zero if it doesn't exist


I've recently started learning PySpark and I'm trying to understand how to return the position of a regex pattern within a column but then if the regex pattern does not exist in the column then to return a 0.

I've tried this other really good answer on StackOverflow which does a great job of returning the position of where a regex pattern starts in another string. https://stackoverflow.com/questions/66121409/locate-function-with-regular-expression-in-spark-sql However, in this answer, when the regex pattern does not exist in the string it returns a 1 instead of what I need which is a 0.

Based on the other answer I've tried to adapt it like below:

spark.sql("select locate(regexp_extract('0000X00', '([X,Y,Z,W])', 1),  '0000X00') as check_pos_string").show()

+----------------+
|check_pos_string|
+----------------+
|               5|
+----------------+

Which as expected returns the position of 5 which is where the character X from the regex pattern occurs in the string it is checking.

However, when I try the following it returns a 1 even though the pattern doesn't exist within the string it is checking.

spark.sql("select locate(regexp_extract('0000000', '([X,Y,Z,W])', 1),  '0000000') as check_pos_string").show()

+----------------+
|check_pos_string|
+----------------+
|               1|
+----------------+

Instead of a 1, I would like to know how to return a 0 if the regex pattern doesn't exist in the string so it looks like below.

+----------------+
|check_pos_string|
+----------------+
|               0|
+----------------+

I have a feeling that I'm misunderstanding something simple due to my lack of experience with PySpark!


Solution

  • A couple of general points:

    • [X,Y,Z,W] is a character class which matches all the characters between the square brackets, this means it matches X, , Y, ,, Z, , and W (note it's matching the commas too). What you actually want is [XYZW] or more concisely [W-Z].
    • You don't need a capture group (...) in your expression; without one just change the group parameter to 0.

    So your call to regexp_extract should look like:

    regexp_extract('0000X00', '[XYZW]', 0)
    

    In terms of your problem, this is caused by regexp_extract returning an empty string when it doesn't match anything, and locate "finding" that string at the beginning of your input and hence returning 1 as the result.

    You can work around that by instead replacing the match with a character that would not normally occur in the string, and then trying to locate that character e.g.

    spark.sql("select locate('*', regexp_replace('0000X0Y', '[W-Z]', '*')) as check_pos_string").show()
    

    Output:

    check_pos_string
    5

    When the match is not found, the special character will not be replaced into the string and so locate will return 0:

    spark.sql("select locate('*', regexp_replace('0000000', '[W-Z]', '*')) as check_pos_string").show()
    

    Output:

    check_pos_string
    0