database postgresql amazon-web-services amazon-redshift postgresql-performance

Optimize large IN condition for Redshift query

I have a ~2TB fully vacuumed Redshift table with a distkey phash (high cardinality, hundreds of millions of values) and compound sortkeys (phash, last_seen).

When I do a query like:

SELECT
    DISTINCT ret_field
FROM
    table
WHERE
    phash IN (
        '5c8615fa967576019f846b55f11b6e41',
        '8719c8caa9740bec10f914fc2434ccfd',
        '9b657c9f6bf7c5bbd04b5baf94e61dae'
    )
AND
    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

It returns very quickly. However when I increase the number of hashes beyond 10, Redshift converts the IN condition from a bunch of ORs to an array, per http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists

The problem is when I have a couple dozen phash values, the "optimized" query goes from less than a second response time to over half an hour. In other words it stops using the sortkey and does a full table scan.

Any idea how I can prevent this behavior and retain the use of sortkeys to keep the query quick?

Here is the EXPLAIN difference between <10 hashes and >10 hashes:

Less than 10 (0.4 seconds):

XN Unique  (cost=0.00..157253450.20 rows=43 width=27)
    ->  XN Seq Scan on table  (cost=0.00..157253393.92 rows=22510 width=27)
                Filter: ((((phash)::text = '394e9a527f93377912cbdcf6789787f1'::text) OR ((phash)::text = '4534f9f8f68cc937f66b50760790c795'::text) OR ((phash)::text = '5c8615fa967576019f846b55f11b6e61'::text) OR ((phash)::text = '5d5743a86b5ff3d60b133c6475e7dce0'::text) OR ((phash)::text = '8719c8caa9740bec10f914fc2434cced'::text) OR ((phash)::text = '9b657c9f6bf7c5bbd04b5baf94e61d9e'::text) OR ((phash)::text = 'd7337d324be519abf6dbfd3612aad0c0'::text) OR ((phash)::text = 'ea43b04ac2f84710dd1f775efcd5ab40'::text)) AND (last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone))

More than 10 (45-60 minutes):

XN Unique  (cost=0.00..181985241.25 rows=1717530 width=27)
    ->  XN Seq Scan on table  (cost=0.00..179718164.48 rows=906830708 width=27)
                Filter: ((last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone) AND ((phash)::text = ANY ('{33b84c5775b6862df965a0e00478840e,394e9a527f93377912cbdcf6789787f1,3d27b96948b6905ffae503d48d75f3d1,4534f9f8f68cc937f66b50760790c795,5a63cd6686f7c7ed07a614e245da60c2,5c8615fa967576019f846b55f11b6e61,5d5743a86b5ff3d60b133c6475e7dce0,8719c8caa9740bec10f914fc2434cced,9b657c9f6bf7c5bbd04b5baf94e61d9e,d7337d324be519abf6dbfd3612aad0c0,dbf4c743832c72e9c8c3cc3b17bfae5f,ea43b04ac2f84710dd1f775efcd5ab40,fb4b83121cad6d23e6da6c7b14d2724c}'::text[])))

Solution

It's worth a try to set sortkeys (last_seen, phash), putting last_seen first.

The reason of slowness might be because the leading column for the sort key is phash which looks like a random character. As AWS redshift dev docs says, the timestamp columns should be as the leading column for the sort key if using that for where conditions.

If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - Choose the Best Sort Key - Amazon Redshift

With this order of the sort key, all columns will be sorted by last_seen, then phash. (What does it mean to have multiple sortkey columns?)

One note is that you have to recreate your table to change the sort key. This will help you to do that.