I've recently started learning PySpark and I'm trying to understand how to return the position of a regex pattern within a column but then if the regex pattern does not exist in the column then to return a 0.
I've tried this other really good answer on StackOverflow which does a great job of returning the position of where a regex pattern starts in another string. https://stackoverflow.com/questions/66121409/locate-function-with-regular-expression-in-spark-sql
However, in this answer, when the regex pattern does not exist in the string it returns a 1 instead of what I need which is a 0.
Based on the other answer I've tried to adapt it like below:
spark.sql("select locate(regexp_extract('0000X00', '([X,Y,Z,W])', 1), '0000X00') as check_pos_string").show()
+----------------+
|check_pos_string|
+----------------+
| 5|
+----------------+
Which as expected returns the position of 5 which is where the character X from the regex pattern occurs in the string it is checking.
However, when I try the following it returns a 1 even though the pattern doesn't exist within the string it is checking.
spark.sql("select locate(regexp_extract('0000000', '([X,Y,Z,W])', 1), '0000000') as check_pos_string").show()
+----------------+
|check_pos_string|
+----------------+
| 1|
+----------------+
Instead of a 1, I would like to know how to return a 0 if the regex pattern doesn't exist in the string so it looks like below.
+----------------+
|check_pos_string|
+----------------+
| 0|
+----------------+
I have a feeling that I'm misunderstanding something simple due to my lack of experience with PySpark!
A couple of general points:
[X,Y,Z,W]
is a character class which matches all the characters between the square brackets, this means it matches X
, ,
Y
, ,
, Z
, ,
and W
(note it's matching the commas too). What you actually want is [XYZW]
or more concisely [W-Z]
.(...)
in your expression; without one just change the group
parameter to 0
.So your call to regexp_extract
should look like:
regexp_extract('0000X00', '[XYZW]', 0)
In terms of your problem, this is caused by regexp_extract
returning an empty string when it doesn't match anything, and locate
"finding" that string at the beginning of your input and hence returning 1
as the result.
You can work around that by instead replacing the match with a character that would not normally occur in the string, and then trying to locate
that character e.g.
spark.sql("select locate('*', regexp_replace('0000X0Y', '[W-Z]', '*')) as check_pos_string").show()
Output:
check_pos_string |
---|
5 |
When the match is not found, the special character will not be replaced into the string and so locate
will return 0
:
spark.sql("select locate('*', regexp_replace('0000000', '[W-Z]', '*')) as check_pos_string").show()
Output:
check_pos_string |
---|
0 |