I am now trying to load all tables from my AWS RDS (PostgreSQL) to Amazon Redshift.
Not so important here though, I use Apache Airflow to do all the operations for me. The jobs detail is like:
For many weeks it works just fine with the Redshift COPY command like this:
TRUNCATE {table};\n\
COPY {table}\n\
FROM '{s3_key}'\n\
IAM_ROLE '{os.getenv('REDSHIFT_IAM_ROLE')}'\n\
FORMAT AS PARQUET\n\
FILLRECORD\n\
;
However, I found the DAG run error this morning and the logs are like this:
Running statement:
TRUNCATE users;
COPY users
FROM '<s3-bucket-name>'
IAM_ROLE '<iam-role>'
PARQUET
FILLRECORD
;
, parameters: None
psycopg2.errors.InternalError_: Assert
DETAIL:
-----------------------------------------------
error: Assert
code: 1000
context: Unreachable - Invalid type: 4000
query: 3514431
location: dory_util.cpp:528
process: padbmaster [pid=4694]
-----------------------------------------------
I have tried to find the logs by query id in the above error message in Redshift by running the command:
SELECT * FROM SVL_S3LOG WHERE query = '3514431';
But even cannot locate the detail of the error anywhere.
Have searched around and asked ChatGPT but I didn't find any similar issues or directions to even find more about the error logs. Only found some issues saying that this may be kinda Redshift Internal Errors.
But for the parquet format and data type, conversion was totally fine. Could anyone please point out or give some suggestions for me to fix my data pipeline issue?
I had the same issue as you and I solve it by:
So, your copy query should look something like this:
COPY users ("col1", "col2", ..., "coln")
FROM '<s3-bucket-name>'
IAM_ROLE '<iam-role>'
FORMAT AS PARQUET SERIALIZETOJSON;
Using this query I managed to load the data.