Search code examples
amazon-redshiftparquetamazon-redshift-spectrum

How do you load specific columns formatted in parquet into Redshift spectrum?


I'm trying to load specific columns into Redshift Spectrum, but it appears that the data is being loaded positionally, which means it doesn't matter what I put for the column names.

To make this more concrete:

Let's say the data I want to load is stored positionally in parquet as columns A, B, C. I only want column B, so I write:

create external table spectrum.Foo(
  B varchar(500)
)
STORED AS PARQUET
LOCATION 's3://data/';

Unfortunately, when I do that, it actually loads the data of A into Foo.B

Is there some extra syntax I need. I scoured the documentation but couldn't find anything.

Thanks!


Solution

  • Defining the external table does not load data into Redshift -- it simply enables you to query it. If you want to ingest only a subset of the columns you can try the following set of queries:

    create external table spectrum.Foo(
      A varchar(100), B varchar(100), C varchar(100)
    ) stored as parquet location 's3://data/';
    

    This defines the external schema:

    create table local_data as select A, C from spectrum.Foo;
    

    This will essentially load only the specified columns from the external table into a local table in Redshift. However, you don't need to do that: The advantage of Redshift Spectrum and Parquet is that when you run the query:

    select A, C from spectrum.Foo;
    

    it will only load the data from S3 for the accessed columns, reducing the number of IO operations to S3 and reducing the cost.