Search code examples
amazon-s3amazon-redshiftetldata-warehouse

How to transform data from S3 bucket before writing to Redshift DW?


I'm creating a (modern) data warehouse in redshift. All of our infrastructure is hosted at Amazon. So far, I have setup DMS to ingest data (including changed data) from some tables of our business database (SQL Server on EC2, not RDS) and store it directly to S3.

Now I must transform and enrich this data from the S3 before I can write it to Redshift. Our DW have some tables for facts and dimensions (star schema), so, imagine a Customer dimension, it should contain not only the customer basic info, but address info, city, state, etc. This data is spread amongst a few tables in our business database.

So here's my problem, I don't have a clear idea of how to query the S3 staging area in order to join these tables and write it to my redshift DW. I want to do it using AWS services like Glue, Kinesis, etc. i.e. full serverless.

Can Kinesis accomplish this task? Would it make things easier if I moved my staging area from S3 to Redshift since all of our data is highly relational in nature anyway? If so, the question remains, how to transform/enrich data before saving it on our DW schemas? I have searched everywhere for this particular topic but information on it is scarse.

Any help is appreciated.


Solution

  • There are a lot of ways to do this but one idea is to query the data using Redshift Spectrum. Spectrum is a way to query S3 (called an external database) using your Redshift cluster.

    Really high-level, one way to do this would be to create a Glue Crawler job to crawl your S3 bucket, which creates the External Database that Redshift Spectrum can query.

    This way, you don't need to move your data into Redshift itself. Likely, you will want to keep your "staging" area in S3 and only bring into Redshift the data that is ready to be used for reporting or analytics, which would be your Customer Dim table.

    Here is the documentation to do this: https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html

    To schedule the ETL SQL: I don't believe there is a scheduling tool built into Redshift but you can do that in a few ways:

    1) Get an ETL tool or set up CRON jobs on a server or Glue that schedules SQL scripts to be ran. I do this with a Python script that connects to the database then runs the SQL text. This would be a little bit more of a bulk operation. You can also do this in a Lambda function and have it be scheduled on a Cloudwatch trigger which can be on a cron schedule

    2) Use a Lambda function that runs the SQL script that you want that triggers for S3 PUTs into that bucket. That way the script will run right when the file drops. This would be basically a realtime operation. DMS drops files very quickly so you will have files dropping multiple times per minute so that might be more difficult to handle.