Identify Redshift COPY failing records in Ruby

When performing a COPY command, a few informations are printed, like :

INFO:  Load into table '<table>' completed, 22666 record(s) loaded successfully.
INFO:  Load into table '<table>' completed, 1 record(s) could not be loaded.  Check 'stl_load_errors' system table for details.

And I need to identify failing records.

Thus I need 2 things :

Determine when there are failing rows: now, it's only printed on screen and I don't know how to get the message in code.
Determine the failing rows.

One way to do that would be to access to the query identifier that is visible in the table stl_load_errors, but I have no clue how to access it by code.

(I currently use the pg gem to connect to redshift)

Solution

stl_load_errors is a table in Redshift that (as you may have guessed already) includes all the errors that happen when loading into Redshift. So you can query it by doing something like:

SELECT * FROM stl_load_errors

Now, to answer your questions use the following snippet:

database = PG.connect(redshift)
begin
  query = "COPY %s (%s) FROM 's3://%s/%s' CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s' CSV GZIP" %
[ table, columns, s3_bucket, s3_key, access_key_id, secret_access_key ]

  database.exec(query)
  puts 'File succesfully imported'
rescue PG::InternalError
  res = database.exec("SELECT line_number, colname, err_reason FROM pg_catalog.stl_load_errors WHERE filename = 's3://#{s3_bucket}/#{s3_key}'")
  res.each do |row|
    puts "Importing failed:\n> Line %s\n> Column: %s\n> Reason: %s" % row.values_at('line_number', 'colname', 'err_reason')
 end
end

That should output all the information you need, recall variables like redshift, table, columns, s3_bucket, s3_key, access_key_id, and secret_access_key depend on your configuration.

UPDATE:

To answer your comment below, more specifically, you could use a query like this:

"SELECT lines_scanned FROM pg_catalog.stl_load_commits WHERE filename = 's3://#{s3_bucket}/#{s3_key}' AND errors = -1"