amazon-s3 etl amazon-redshift dimensional-modeling

Dimensional Modeling and ETL in Redshift

I have been researching Amazon's Redshift database as a possible future replacement for our data warehouse. My experience has always been in using dimensional modeling and Ralph Kimball's methods, so it was a little weird to see that Redshift doesn't support features such as the serial data type for auto incrementing columns.

There is, however, this recent blog post from the AWS Big Data blog about how to optimize Redshift for a star schema: https://blogs.aws.amazon.com/bigdata/post/Tx1WZP38ERPGK5K/Optimizing-for-Star-Schemas-and-Interleaved-Sorting-on-Amazon-Redshift

The question I have is about what is the best practice for loading a star schema in Redshift? I cannot find this answered in any of Redshift's documentation.

I'm leaning toward importing my files from S3 into staging tables and then using SQL to do the transformations such as lookups and generating surrogate keys before inserting into the destination tables.

Is this what others are currently doing? Is there an ETL tool worth the money to make this easier?

Solution

This is the best practice I use:

Load the raw data of the events source into a table.
Apply business logic into a staging table in the format of the DWH fact table.
Find the new dimension values by running minus command of distinct values in staging minus dimension table.
Create new row in dimension table for referential integrity with data from the fact or with default value.
1. Add the rows from staging to DWH fact table.

In a different process, if there are additional attributes from another source to the dimension table, I'm running an upsert procedure by loading into a table, then doing and update with join and then insert only the new rows.

Performance thinking - as long as you can do this inside redshift or previously on EMR or alike it would be better then if you try to do it with an ETL tool that will generates a lot of inserts commands into Redshift.