Search code examples
amazon-s3copyamazon-redshiftvacuum

what is an optimal load path for Redshift?


in this blog https://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-Practices-for-Micro-Batch-Loading-on-Amazon-Redshift

I fail to see what is the main point of the load stream approach. Where does the saving come from? I understand about turning off Analyze, Sort, Vacuum, using fixed compression and all that. But why the separate tracks or empty temporary tables? You can do all this without the multiple tracks. Is there some meaning in the phrase "an optimal load path for COPY to follow" What exactly is optimal? the S3 prefix or something else? whatever is optimal, Why would it not be optimal if you load directory into the final schema? Are we assuming that reading s3 files is the bottleneck?

Could someone provide some insight?


Solution

  • If you are simply appending directly to final tables then they may be unnecessary. The tracks are used to avoid lock contention for load work (insert/update/delete) on tables that are actively being queried downstream.

    For instance, a transaction cannot complete while another process is using the table, e.g. a long running data warehouse transformation is running while the table is being loaded. Using multiple tracks allows further loads to be staged and processed in parallel.

    You may also wish to use temporary tables to take advantage of the ALTER APPEND functionality that was recently added. This allows you to do a sorted load to a separate table and instantly append those rows to the final table in a fully sorted state. If you load directly to final the rows will be marked as unsorted until the table is VACUUMed.