I have a parquet file generated by clickhouse, if use pyarrow to show its schema:
import pyarrow.parquet as pq
data = pq.read_table('test.pqt')
print(data.schema)
It shows the schema was like this:
team_id: binary not null
project_id: uint32 not null
user_id: binary not null
And I try to create a external table in Redshift to read this data file:
CREATE external TABLE spectrum.my_test_table
(
team_id varbyte(64),
project_id bigint,
user_id varbyte(64)
)
But Spectrum return an exception saying:
error: Spectrum Scan Error code: 15007 context: File 'https://s3.us-west-2.amazonaws.com/my_bucket/my_parquet.pqt' uses an unsupported compression: 7 for column 'team_id'
I also try to use VARCHAR
instead of VARBYTE
but the error was the same. How can I read these binary columns using Redshift Spectrum?
====Update====
I do some more test on this problem, and I found all columns in parquet generated by Clickhouse would have this error in Redshift:
xxx.parquet uses an unsupported compression: 7 for column xxx
Even for those int or string columns, what's wrong with Clickhouse's parquet?
As @0x26res said, I found the lz4
was the default compression of Clickhouse which is not supported by Redshift. So I change the compression to gzip
in Clickhouse and it's alright now.