amazon-redshift amazon-athena amazon-redshift-spectrum

Cost control in Redshift Spectrum when scanning external tables (S3 data)

Athena has some default service limits that can help ~ cap the cost from accidental "runaway" queries on a large data lake in S3. They are not great (based on ~ time, not volume of data scanned), but it's still helpful.

What about Redshift Spectrum? What mechanisms does it provide can be easily used to cap cost or mitigate the risk of "accidentally" scanning too much data in a single runaway query against S3? What's a good way of tackling this problem?

Solution

Amazon Redshift allows you to apply granular controls over Spectrum query execution using WLM Query Monitoring Rules.

There are 2 Spectrum metrics available: Spectrum scan size (Number of mb scanned by the query) and Spectrum scan row count (Number of rows scanned by the query).

You can also use Query execution time to enforce a maximum duration but this will apply to all query types not just Spectrum.

Please note that these are sampled metrics. Queries are not aborted at precisely the point when they exceed the rule, they are aborted at the next sample interval.

If you have been running Spectrum queries on your cluster already you can get started with QMR by using our script wlm_qmr_rule_candidates to generate candidate rules. The generated rules are based on the 99th percentiles for each metric.