Search code examples
regexextract

Does regexp_extract work for multiple patterns?-Spark sql


Pattern 1:Delimited by |

Input : a|b|c|d     
Output: a|b|c|d 

Pick everything when delimited by a single pipe

Pattern 2:Delimited by | and ||
Example1:

Input :a|b||c||d       
Output:a|b||c   

Pick everything before last double pipe

Example2:

Input :a|b||c|d     
Output:a|b   

Pattern 3:Beginning of the string can have multiple pipes(odd or even) and further be deilimited by | and ||

Input :|||a|b||c||d     
Output:|||a|b||c 

Pick everything before last double pipe ,beginning of the string might have odd or even pipes and they must be selected.

Below covers all except scenario 1,My requirement is to cover all scenarios in one regexp_extract spark.sql("select regexp_extract('name|place|thing|ink', '(.*)(?=\\\\|\\\\|)') as demo").show(false)

If it can not be done in one regexp_extract.Can you suggest other options.

Please advise.


Solution

  • Use the following RegEx:

    ^(\|*(?:(?!\|\|(?!.*\|\|)).)*)
    

    See the RegEx Demo showing all the matches

    This is a rather complicated requirement and requires the use of Tempered Greedy Token together with Negative Lookahead within the Tempering pattern. Let me explain the logics below:

    Logics

    • ^ to match only from the beginning of string
    • (...) enclose the entire pattern after ^ to make it a capturing group
    • \|* for the requirement of Pattern 3 to match the multiple | at the beginning, as many as possible (hence use greedy *)
    • (?:(?!...).)* this is the main construct (skeleton) of Tempered Greedy Token whose details I will explain below:
    • \|\|(?!.*\|\|) this is the main body (core) of the Tempered Greedy Token. The first part before ( is to ensure the characters match up to but not including the pattern || The second part (?!.*\|\|) is to ensure the || pattern in the first part is not followed by any other double pipes || somewhere after, as per the requirement.

    In fact, I think the question is quite interesting and requires sophisticated features of RegEx to support it. This is also the first example I seen so far that requires a Negative Lookahead within a Tempered Greedy Token construct.