Search code examples
amazon-web-servicesamazon-redshiftparquetclickhouseamazon-redshift-spectrum

How to read binary type column in parquet file by AWS Redshift Spectrum?


I have a parquet file generated by clickhouse, if use pyarrow to show its schema:

import pyarrow.parquet as pq

data = pq.read_table('test.pqt')
print(data.schema)

It shows the schema was like this:

team_id: binary not null
project_id: uint32 not null
user_id: binary not null

And I try to create a external table in Redshift to read this data file:

CREATE external TABLE spectrum.my_test_table
(
    team_id varbyte(64),
    project_id bigint,
    user_id varbyte(64)
)

But Spectrum return an exception saying:

error: Spectrum Scan Error code: 15007 context: File 'https://s3.us-west-2.amazonaws.com/my_bucket/my_parquet.pqt' uses an unsupported compression: 7 for column 'team_id'

I also try to use VARCHAR instead of VARBYTE but the error was the same. How can I read these binary columns using Redshift Spectrum?

====Update====

I do some more test on this problem, and I found all columns in parquet generated by Clickhouse would have this error in Redshift:

xxx.parquet uses an unsupported compression: 7 for column xxx

Even for those int or string columns, what's wrong with Clickhouse's parquet?


Solution

  • As @0x26res said, I found the lz4 was the default compression of Clickhouse which is not supported by Redshift. So I change the compression to gzip in Clickhouse and it's alright now.