My question basically is: How can I tell my database to do a seq scan and STOP after the first match to my WHERE condition?
Assuming I want to find the first event of a certain type, I could write the following query:
select *
from installs
where country = 'China'
order by install_date
limit 1
The problem here is that according to order of operations the engine would scan all the table and generate a dataset that matches my filter, then sort this dataset (with an immense cost), and then return only the first row.
I could of course filter by specific dates, but let's assume I don't know the period to filter by - how can I optimize this type of query in Amazon Redshift (something in the where clause maybe)?
Redshift's general strategy is to do a lot of scanning, but parallelize it. Any case that involves getting a single row is not going to be ideal. That said, you can do four things:
1. Reduce scanning, to a point
If country will always be the field filtered on, set the sortkey for the table to a compound sortkey on country first.
2. Eliminate the need for sorting
A more efficient way to do ORDER BY x LIMIT 1
is often MAX.
Then try
SELECT *
FROM installs
WHERE pk = (
SELECT MAX(pk) -- or install_date, if install date is unique
FROM installs
WHERE country = 'China'
)
3. Tailor the selected columns between row oriented and columnar
Asking a columnar database like Redshift to select * incurs costs for each column. Try selecting only the columns you need.
4. Add more nodes, so each node does less scanning
(Make sure the data is not set to distribution style all)