amazon-web-services amazon-redshift data-warehouse amazon-athena

How to control increasing data volume in Redshift?

I have a data warehouse maintained in AWS Redshift. The data volume and velocity both have increased lately. One option is to keep scaling the cluster horizontally at the expanse of a higher cost of course. I was wondering if there are any archiving options available so that I can query the entire data as usual (maybe with a compromise in the querying time) but with a low or no additional cost?

One option would be to use external tables and query data directly from S3 but the tools used for achieving this, like Athena and Glue have their own cost, that too on a per query basis.

Solution

Easy options:

Ensure all tables have compression SELECT * FROM svv_table_info;.
Maximize compression by changing large tables to use ENCODE zstd.
Switch small tables < ~50k rows (depends) to DISTSTYLE ALL (yes this saves space!).
Switch from SSD based nodes (dc2) to HDD nodes (ds2) which have more 8x storage space.

Less easy options:

UNLOAD older data from Redshift to S3 and query using Redshift Spectrum.
Convert unloaded data to Parquet or ORC format using AWS Glue or AWS EMR and then query using Redshift Spectrum.

Please experiment with Redshift Spectrum. Query performance is typically very good and gets even better if your data is in a columnar format (Parquet/ORC).