Search code examples
pandasamazon-redshiftparquet

AWS Redshift Spectrum decimal type to read parquet double type


I have parquet files written by Pandas(pyarrow) with fields in Double type. Following is parquet schema:

message schema {
  optional binary domain (STRING);
  optional binary type;
  optional binary country;
  optional binary region;
  optional binary country_code (STRING);
  optional int64 date (TIMESTAMP(MILLIS,true));
  optional double visits;
  optional double average_visit_duration;
}

Problem: I used Redshift Spectrum to create external table to read data in those parquet. Table schema:

CREATE EXTERNAL TABLE spectrum.similarweb_daily_current(
  domain varchar(200), 
  type varchar(200), 
  country varchar(200), 
  region varchar(200), 
  country_code varchar(200), 
  visits decimal(38,37), 
  average_visit_duration decimal(38,37))
STORED as PARQUET
LOCATION
  's3://XXX'

When doing simple select query, it shows error that schema incompatible => Double vs Decimal. Do anyone have any idea how to solve it?

Study I did:
On Pandas/pyarrow, it seems I can't adjust the schema to decimal when writing into parquet.
On Redshift, Double type doesn't support external table(spectrum)


Solution

  • Solved this. Use Redshift "Double Precision" Type for double in parquet.