I have a table similar to this:
Data have;
text = 'insurance premium'; output;
text = 'insur. premium'; output;
text = 'premium. insur aa'; output;
text = 'premium card'; output;
text = 'sales premium'; output;
Run;
My task is to select all transactions that contain the word premium, but do not contain the word insurance or a form thereof (e.g. insur, ins. etc.). I read up on how to use lookaround expressions in regex and wrote the following expression:
/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s)/i
The expression seems to work on testing websites such as https://regexr.com/, but when I run the code below I get an error in SAS:
Data want;
Set have;
re = prxparse('/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s)/i');
flg = prxmatch(re, text) > 0;
Run;
ERROR: Variable length lookbehind not implemented before HERE mark in regex m/(?<!insur[a-z.]*\s)premium(?!.*insur[a-z.]*\s) <<
HERE /.
ERROR: Variable length lookbehind not implemented before HERE mark in regex m/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s) << HERE /.
ERROR: The regular expression passed to the function PRXPARSE contains a syntax error.
NOTE: Argument 1 to function PRXPARSE('/(?<!ins[a-z'[12 of 45 characters shown]) at line 30 column 6 is invalid.
NOTE: Argument 1 to the function PRXMATCH is missing.
As far as I understood there is an issue with the *
symbols inside the lookaround functions, because the error does not occur if I remove them. Does SAS implement such expressions differently or does it simply not support such expressions?
You are using flg = prxmatch(re, text) > 0;
to see if there is a match by checking if the position is > 0
You can put the negative lookahead at the start of the string to check for the variations of insurance, and then match the word premium.
^(?!.*\bins[a-z.]*\s).*\bpremium\b
Explanation
^
Start of string(?!
Negative lookahead, assert that on the right is not
.*\bins
Match a word starting with ins
[a-z.]*\s
Optionally repeat matching chars a-z
or .
followed by a whitespace char)
Close the lookahead.*\bpremium\b
match the word premium
in the line