I have been researching Amazon's Redshift database as a possible future replacement for our data warehouse. My experience has always been in using dimensional modeling and Ralph Kimball's methods, so it was a little weird to see that Redshift doesn't support features such as the serial data type for auto incrementing columns.
There is, however, this recent blog post from the AWS Big Data blog about how to optimize Redshift for a star schema: https://blogs.aws.amazon.com/bigdata/post/Tx1WZP38ERPGK5K/Optimizing-for-Star-Schemas-and-Interleaved-Sorting-on-Amazon-Redshift
The question I have is about what is the best practice for loading a star schema in Redshift? I cannot find this answered in any of Redshift's documentation.
I'm leaning toward importing my files from S3 into staging tables and then using SQL to do the transformations such as lookups and generating surrogate keys before inserting into the destination tables.
Is this what others are currently doing? Is there an ETL tool worth the money to make this easier?
This is the best practice I use:
In a different process, if there are additional attributes from another source to the dimension table, I'm running an upsert procedure by loading into a table, then doing and update with join and then insert only the new rows.
Performance thinking - as long as you can do this inside redshift or previously on EMR or alike it would be better then if you try to do it with an ETL tool that will generates a lot of inserts commands into Redshift.