Search code examples
amazon-s3tableau-apiamazon-athenaamazon-redshift-spectrum

Redshift Spectrum much slower than Athena?


Our data is stored in S3 as JSON without partitions. Until today we were using only athena but now we tried Redshift Spectrum.

We are running the same query twice. Once using Redshift Spectrum and once using Athena. Both connect to the same data in S3.

Using Redshift Spectrum this report takes forever(more than 15 minutes) to run and using Athena it only takes 10 seconds to run.

The query that we are running in both cases in aws console is this:

SELECT "events"."persistentid" AS "persistentid",
  SUM(1) AS "sum_number_of_reco"
FROM "analytics"."events" "events"
GROUP BY "events"."persistentid"

Any idea what's going on? Thanks


Solution

  • The Redshift Spectrum processing power is limited by Redshift cluster size.

    You can find the infomation from Improving Amazon Redshift Spectrum Query Performance

    The Amazon Redshift query planner pushes predicates and aggregations to the Redshift Spectrum query layer whenever possible. When large amounts of data are returned from Amazon S3, the processing is limited by your cluster's resources. Redshift Spectrum scales automatically to process large requests. Thus, your overall performance improves whenever you can push processing to the Redshift Spectrum layer.

    On the other hand, Athena uses optimized amount of resource for the query, which may be larger than the Spectrum of a small Redshift cluster can get.

    This has been confirmed by our testing on Redshift Spectrum performance with different Redshift cluster size.