amazon-s3 amazon-redshift amazon-athena amazon-redshift-spectrum

Redshift Spectrum over 40x slower than Athena for simple queries

I have a S3 data lake that I can query with Athena. The same data lake is hooked up to Amazon Redshift as well. However when I run queries in Redshift I get insanely longer query times compared to Athena, even for the most simple queries.

Query in Athena

CREATE TABLE x as (select p.anonymous_id, p.context_traits_email, p."_timestamp", p.user_id FROM foo.pages p)

Run time: 24.432 sec
Data scanned: 111.47 MB

Query in Redshift

create temp table x as (
select p.anonymous_id, p.context_traits_email, p."_timestamp", p.user_id
FROM external_schema.pages p
)

The query takes 17 minutes to run!

Datalake & Glue

The datalake has a glue catalog attached that is maintained by a third party tool (RudderStack). There are no crawlers, RudderStack places parquet files in specific parts of the S3 bucket and updates the Glue catalog if there are schema changes, etc.

Here are the relevant parts of the Glue Table definition:

Location: s3://{bucketname}/rudder-datalake/{schemaname}/pages
Input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Serde serialization lib: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Partitions: none
Indexes: none

Redshift External Schema

The external schema in redshift was created like this:

create external schema if not exists external_schema
from data catalog
database 'foo'
region 'us-east-1'
iam_role 'arn:aws:iam::xxxxx';

The cpu utilization on the redshift cluster while the query is running (single d2.large node) never goes over 15% during the entire query runtime. So if it isn't a matter of redshift being overloaded, how come this huge difference in query processing, especially when there aren't any joins, distincts, etc. in the query?

EDIT

Redshift's SVL_S3QUERY_SUMMARY lists only one entry for this query:

external_table_name: S3 Scan {redshiftdb}_external_schema_pages
files: 144,067
splits: 144,067
avg_request_parallelism: 10

Solution

You have way too many files.

The overhead of S3 access is the problem.

Remember, you get at most ten Spectrum workers per slice, but in practise less, and you probably have a smaller cluster. You might be getting ten files being accessed concurrently, with 144,000 files to read, and each S3 connection incurring an irreduceable delay.