Given a data-source of 1.4 TB of Parquet data on S3 partitioned by a timestamp field (so partitions are year
- month
- day
) I am querying a specific day of data (2.6 GB of data) and retrieving all available fields in the Parquet files via Redshift Spectrum with this query:
SELECT *
FROM my_external_schema.my_external_table
WHERE year = '2020' and month = '01' and day = '01'
The table is made available via a Glue Crawler that points at the top level "folder" in S3; this creates a Database and then via this command I link the Database to the new external schema:
create external schema my_external_schema from data catalog
database 'my_external_schema'
iam_role 'arn:aws:iam::123456789:role/my_role'
region 'my-region-9';
Analysing the table in my IDE I can see the table is generated by this statement:
create external table my_external_schema.my_external_table
(
id string,
my_value string,
my_nice_value string
)
partitioned by (year string, month string, day string)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
with serdeproperties ('serialization.format'='1')
stored as
inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 's3://my-bucket/my/location/'
table properties ('CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='my_crawler');
When I analyse the query from Redshift I see it was scanned ~86 GB of data instead.
How's that possible? It is a concern because Redshift bills based on the amount of data scanned and looks like the service is scanning around 40 times the actual amount of data is in that partition.
I also tried to execute the same query in Athena and there I get only 2.55 GB of data scanned (definitely more reasonable).
I can't give too many details on the cluster size but assume that those 86GB of scanned data would fit in the cluster's Memory.
The problem seems to be in the AWS Redshift Console.
If we analyse the query from "query details" in Redshift console, I can see that the "Total data scanned" reports 86GB. As Vzarr mentioned, I run the same query on Athena to compare the performance. The execution time was basically the same but the amount of data scanned was completely different: 2.55GB.
I did the same comparison with other queries on S3 external schema, with and without using partitions columns: I saw that the total of GB scanned differs in every test, sometimes differs a lot (320MB in Redshift Spectrum, 20GB in Athena).
I decided to look at the system tables in Redshift in order to understand how the query on the external schema was working. I did a very simple test using SVL_S3QUERY:
SELECT (cast(s3_scanned_bytes as double precision) / 1024 / 1024 / 1024) as gb_scanned,
s3_scanned_rows,
query
FROM SVL_S3QUERY
WHERE query = '<my-query-id>'
The result was completely different from what AWS Redshift Console says for the same query. Not only the gb_scanned
was wrong, but s3_scanned_rows
was too. The query returns a total of 2.55GB of data Scanned, exactly the same of what Athena said.
To confirm the numbers in the SVL_S3QUERY I used AWS Cost Explorer to double check the total of gb scanned in a day with how much we paid for Redshift Spectrum: the numbers were basically the same.
At this point, I don't know from where or which table the AWS Redshift Console take the query details, but they seem to be completely wrong.