Search code examples
rubycopyamazon-redshiftpg

Identify Redshift COPY failing records in Ruby


When performing a COPY command, a few informations are printed, like :

INFO:  Load into table '<table>' completed, 22666 record(s) loaded successfully.
INFO:  Load into table '<table>' completed, 1 record(s) could not be loaded.  Check 'stl_load_errors' system table for details.

And I need to identify failing records.

Thus I need 2 things :

  • Determine when there are failing rows: now, it's only printed on screen and I don't know how to get the message in code.
  • Determine the failing rows.

One way to do that would be to access to the query identifier that is visible in the table stl_load_errors, but I have no clue how to access it by code.

(I currently use the pg gem to connect to redshift)


Solution

  • stl_load_errors is a table in Redshift that (as you may have guessed already) includes all the errors that happen when loading into Redshift. So you can query it by doing something like:

    SELECT * FROM stl_load_errors
    

    Now, to answer your questions use the following snippet:

    database = PG.connect(redshift)
    begin
      query = "COPY %s (%s) FROM 's3://%s/%s' CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s' CSV GZIP" %
    [ table, columns, s3_bucket, s3_key, access_key_id, secret_access_key ]
    
      database.exec(query)
      puts 'File succesfully imported'
    rescue PG::InternalError
      res = database.exec("SELECT line_number, colname, err_reason FROM pg_catalog.stl_load_errors WHERE filename = 's3://#{s3_bucket}/#{s3_key}'")
      res.each do |row|
        puts "Importing failed:\n> Line %s\n> Column: %s\n> Reason: %s" % row.values_at('line_number', 'colname', 'err_reason')
     end
    end
    

    That should output all the information you need, recall variables like redshift, table, columns, s3_bucket, s3_key, access_key_id, and secret_access_key depend on your configuration.

    UPDATE:

    To answer your comment below, more specifically, you could use a query like this:

    "SELECT lines_scanned FROM pg_catalog.stl_load_commits WHERE filename = 's3://#{s3_bucket}/#{s3_key}' AND errors = -1"