Search code examples
sasregex-lookarounds

Regex lookaround does not work with quantifiers in SAS


I have a table similar to this:

Data have;
text = 'insurance premium'; output;
text = 'insur. premium'; output;
text = 'premium. insur aa'; output;
text = 'premium card'; output;
text = 'sales premium'; output;
Run;

My task is to select all transactions that contain the word premium, but do not contain the word insurance or a form thereof (e.g. insur, ins. etc.). I read up on how to use lookaround expressions in regex and wrote the following expression:

/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s)/i

The expression seems to work on testing websites such as https://regexr.com/, but when I run the code below I get an error in SAS:

Data want;
Set have;
re = prxparse('/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s)/i');
flg = prxmatch(re, text) > 0;
Run;

ERROR: Variable length lookbehind not implemented before HERE mark in regex m/(?<!insur[a-z.]*\s)premium(?!.*insur[a-z.]*\s) << 
       HERE /.
ERROR: Variable length lookbehind not implemented before HERE mark in regex m/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s) << HERE /.

ERROR: The regular expression passed to the function PRXPARSE contains a syntax error.
NOTE: Argument 1 to function PRXPARSE('/(?<!ins[a-z'[12 of 45 characters shown]) at line 30 column 6 is invalid.
NOTE: Argument 1 to the function PRXMATCH is missing.

As far as I understood there is an issue with the * symbols inside the lookaround functions, because the error does not occur if I remove them. Does SAS implement such expressions differently or does it simply not support such expressions?


Solution

  • You are using flg = prxmatch(re, text) > 0; to see if there is a match by checking if the position is > 0

    You can put the negative lookahead at the start of the string to check for the variations of insurance, and then match the word premium.

    ^(?!.*\bins[a-z.]*\s).*\bpremium\b
    

    Explanation

    • ^ Start of string
    • (?! Negative lookahead, assert that on the right is not
      • .*\bins Match a word starting with ins
      • [a-z.]*\s Optionally repeat matching chars a-z or . followed by a whitespace char
    • ) Close the lookahead
    • .*\bpremium\b match the word premium in the line

    Regex demo