I'm querying data from AWS Redshift by using the POSIX regular expressions. However, I have difficulties to find the whole string by finding multiple words without considering the order.
The table is like this:
ID | full_term
123 | juice apple farm
123 | apple juice original
123 | banana juice
For example, I'm looking for a whole string that contains both apple
and juice
, so I expect to get the first two rows. My current query:
SELECT full_term FROM data_table
WHERE full_term ~ '(.*apple+)(.*juice+).*$'
However, the order does matter in this method. I also tried full_term ~ '(?=.*apple+)(?=.*juice+).*$'
but I got an error message [Amazon](500310) Invalid operation: Invalid preceding regular expression prior to repetition operator. The error occurred while parsing the regular expression fragment: '(?>>>HERE>>>=.*apple+)'.
I just realized ?=
does not work in Redshift.
Is using the UDF the only solution in this case?
Also, I only want exact apple
and juice
in full terms. That is to say, pineapple
should not be included.
This is probably most clearly written as AND
ed separate regular expression matches. To ensure that you don't match e.g. pineapple
when looking for apple
, you need to check that on either side of the search term is either a space character or the beginning/end of the line:
SELECT full_term FROM data_table
WHERE full_term ~ '(^|\\s)apple(\\s|$)'
AND full_term ~ '(^|\\s)juice(\\s|$)'