Search code examples
regexposixamazon-redshift

Redshift POSIX regex order does not matter


I'm querying data from AWS Redshift by using the POSIX regular expressions. However, I have difficulties to find the whole string by finding multiple words without considering the order.

The table is like this:

ID  | full_term 
123 | juice apple farm
123 | apple juice original
123 | banana juice

For example, I'm looking for a whole string that contains both apple and juice, so I expect to get the first two rows. My current query:

SELECT full_term FROM data_table
WHERE full_term ~ '(.*apple+)(.*juice+).*$'

However, the order does matter in this method. I also tried full_term ~ '(?=.*apple+)(?=.*juice+).*$' but I got an error message [Amazon](500310) Invalid operation: Invalid preceding regular expression prior to repetition operator. The error occurred while parsing the regular expression fragment: '(?>>>HERE>>>=.*apple+)'. I just realized ?= does not work in Redshift.

Is using the UDF the only solution in this case? Also, I only want exact apple and juice in full terms. That is to say, pineapple should not be included.


Solution

  • This is probably most clearly written as ANDed separate regular expression matches. To ensure that you don't match e.g. pineapple when looking for apple, you need to check that on either side of the search term is either a space character or the beginning/end of the line:

    SELECT full_term FROM data_table
    WHERE full_term ~ '(^|\\s)apple(\\s|$)'
      AND full_term ~ '(^|\\s)juice(\\s|$)'