Search code examples
amazon-web-servicessql-updateamazon-redshiftcreate-tablepostgresql-copy

Simultaneously `CREATE TABLE LIKE` in AWS Redshift and change a few of columns' default values


Workflow

  1. In a data import workflow, we are creating a staging table using CREATE TABLE LIKE statement.
CREATE TABLE abc_staging (LIKE abc INCLUDING DEFAULTS);
  1. Then, we run COPY to import CSV data from S3 into the staging table.

  2. The data in CSV is incomplete. Namely, there are fields partition_0, partition_1, partition_2 which are missing in the CSV file; we fill them in like this:

UPDATE
  abc_staging
SET
  partition_0 = 'BUZINGA',
  partition_1 = '2018',
  partition_2 = '07';

Problem

This query seems expensive (takes ≈20 minutes oftentimes), and I would like to avoid it. That could have been possible if I could configure DEFAULT values on these columns when creating the abc_staging table. I did not find any method as to how that can be done; nor any explicit indication that is impossible. So perhaps this is still possible but I am missing how to do that?

Alternative solutions I considered

Drop these columns and add them again

That would be easy to do, but ALTER TABLE ADD COLUMN only adds columns to the end of the column list. In abc table, they are not at the end of the column list, which means the schemas of abc and abc_staging will mismatch. That breaks ALTER TABLE APPEND operation that I use to move data from staging table to the main table.

Note. Reordering columns in abc table to alleviate this difficulty will require recreating the huge abc table which I'd like to avoid.

Generate the staging table creation script programmatically with proper columns and get rid of CREATE TABLE LIKE

I will have to do that if I do not find any better solution.

Fill in the partition_* fields in the original CSV file

That is possible but will break backwards compatibility (I already have perhaps hundreds thousands of files in there). Harder but manageable.


Solution

  • As you are finding you are not creating a table exactly LIKE the original and Redshift doesn't let you ALTER a column's default value. Your proposed path is likely the best (define the staging table explicitly).

    Since I don't know your exact situation other paths might be better so me explore a bit. First off when you UPDATE the staging table you are in fact reading every row in the table, invalidating that row, and writing a new row (with new information) at the end of the table. This leads to a lot of invalidated rows. Now when you do ALTER TABLE APPEND all these invalidated rows are being added to your main table. Unless you vacuum the staging table before hand. So you may not be getting the value you want out of ALTER TABLE APPEND.

    You may be better off INSERTing the data onto your main table with an ORDER BY clause. This is slower than the ALTER TABLE APPEND statement but you won't have to do the UPDATE so the overall process could be faster. You could come out further ahead because of reduced need to VACUUM. Your situation will determine if this is better or not. Just another option for your list.

    I am curious about your UPDATE speed. This just needs to read and then write every row in the staging table. Unless the staging table is very large it doesn't seem like this should take 20 min. Other activity could be creating this slowdown. Just curious.

    Another option would be to change your main table to have these 3 columns last (yes this would be some work). This way you could add the columns to the staging table and things would line up for ALTER TABLE APPEND. Just another possibility.