Search code examples
amazon-web-servicesamazon-redshift

Will help in any way locking a table before upserting on Redshift?


I got into this code while reviewing some repos, I know what lock does, know also that Redshift has a way of maintaining a current version of a table between queries using snapshots. So the question is, does that lock at the begin of the transaction make any sense? I think that it's bringing more issues locking users than solutions. Thanks!

begin transaction;
          lock ${rsDestination};
          
          create temp table stage (like ${rsDestination});
          
          insert into stage 
          (${_.join(_.keys(schema), ', ')})
          values ${formattedRecords.join(',\n')};
                                                                                                                  
          delete from ${rsDestination}
          using stage 
          where ${_.join(_.map(primaryKeys, (key) => `${rsDestination}.${key} = stage.${key}`), ' AND ')};
                          
          insert into ${rsDestination}
          select * from stage;
          end transaction;
          drop table stage;`;

There are no apparent problems commenting on that line, I want to know if I'm missing something. Thanks!


Solution

  • I would take this code with a grain of salt - there is an issue here beyond the lock. See below.

    As you see the lock is not needed for this code. It likely got here by copying code for someone else where it may have been needed. This is done to stop dependency loops in ETL code. It is a hack and the best way to prevent these is to design your ETL code w/o concurrent loops.

    In this code the source is a VALUES() statement which cannot be part of the dependency loop. This does not good and may just slow other queries on the cluster. There is only 1 perm table in this code, so no loops are possible.

    Now to the other issue that this code has: Using VALUES() to insert data. Doing this rarely for 10 lines of data is fine, it's small and infrequent. Doing this frequently for any size of data or with large blocks of data is a cluster killer. This loads the leader node by pushing data through the query compiler. It is limited to statement character limit (64K) so is a source of errors. Just don't do this. Use a COPY from S3.