sql amazon-web-services amazon-redshift sqlperformance

What's an Efficient Alternative to LIMIT?

My question basically is: How can I tell my database to do a seq scan and STOP after the first match to my WHERE condition?

Assuming I want to find the first event of a certain type, I could write the following query:

select *
from installs
where country = 'China' 
order by install_date
limit 1

The problem here is that according to order of operations the engine would scan all the table and generate a dataset that matches my filter, then sort this dataset (with an immense cost), and then return only the first row.

I could of course filter by specific dates, but let's assume I don't know the period to filter by - how can I optimize this type of query in Amazon Redshift (something in the where clause maybe)?

Solution

Redshift's general strategy is to do a lot of scanning, but parallelize it. Any case that involves getting a single row is not going to be ideal. That said, you can do four things:

1. Reduce scanning, to a point

If country will always be the field filtered on, set the sortkey for the table to a compound sortkey on country first.

2. Eliminate the need for sorting

A more efficient way to do ORDER BY x LIMIT 1 is often MAX.

Then try

SELECT *
FROM installs
WHERE pk = (
  SELECT MAX(pk)  -- or install_date, if install date is unique
  FROM installs
  WHERE country = 'China'
)

3. Tailor the selected columns between row oriented and columnar

Asking a columnar database like Redshift to select * incurs costs for each column. Try selecting only the columns you need.

4. Add more nodes, so each node does less scanning

(Make sure the data is not set to distribution style all)