Search code examples
pythonpostgresqlamazon-web-servicesamazon-redshiftparquet

Copy parquet from S3 to Redshift Fail: Unreachable Invalid type: 4000


I am now trying to load all tables from my AWS RDS (PostgreSQL) to Amazon Redshift.

Not so important here though, I use Apache Airflow to do all the operations for me. The jobs detail is like:

  1. Export all the tables in RDS, convert them to parquet files and upload them to S3
  2. Extract the tables' schema from Pandas Dataframe to Apache Parquet format
  3. Upload the Parquet files in S3 to Redshift

For many weeks it works just fine with the Redshift COPY command like this:

TRUNCATE {table};\n\
COPY {table}\n\
FROM '{s3_key}'\n\
IAM_ROLE '{os.getenv('REDSHIFT_IAM_ROLE')}'\n\
FORMAT AS PARQUET\n\
FILLRECORD\n\
;

However, I found the DAG run error this morning and the logs are like this:

Running statement: 
                            TRUNCATE users;
                            COPY users
                            FROM '<s3-bucket-name>'
                            IAM_ROLE '<iam-role>'
                            PARQUET
                            FILLRECORD
                            ;
                        , parameters: None


psycopg2.errors.InternalError_: Assert
DETAIL:  
  -----------------------------------------------
  error:  Assert
  code:      1000
  context:   Unreachable - Invalid type: 4000
  query:     3514431
  location:  dory_util.cpp:528
  process:   padbmaster [pid=4694]
  -----------------------------------------------

I have tried to find the logs by query id in the above error message in Redshift by running the command:

SELECT * FROM SVL_S3LOG WHERE query = '3514431';

But even cannot locate the detail of the error anywhere.

Have searched around and asked ChatGPT but I didn't find any similar issues or directions to even find more about the error logs. Only found some issues saying that this may be kinda Redshift Internal Errors.

But for the parquet format and data type, conversion was totally fine. Could anyone please point out or give some suggestions for me to fix my data pipeline issue?


Solution

  • I had the same issue as you and I solve it by:

    • specifying columns
    • adding SERIALIZETOJSON

    So, your copy query should look something like this:

    COPY users ("col1", "col2", ..., "coln")
    FROM '<s3-bucket-name>'
    IAM_ROLE '<iam-role>'
    FORMAT AS PARQUET SERIALIZETOJSON;
    

    Using this query I managed to load the data.