I'm trying to load specific columns into Redshift Spectrum, but it appears that the data is being loaded positionally, which means it doesn't matter what I put for the column names.
To make this more concrete:
Let's say the data I want to load is stored positionally in parquet as columns A, B, C. I only want column B, so I write:
create external table spectrum.Foo(
B varchar(500)
)
STORED AS PARQUET
LOCATION 's3://data/';
Unfortunately, when I do that, it actually loads the data of A into Foo.B
Is there some extra syntax I need. I scoured the documentation but couldn't find anything.
Thanks!
Defining the external table does not load data into Redshift -- it simply enables you to query it. If you want to ingest only a subset of the columns you can try the following set of queries:
create external table spectrum.Foo(
A varchar(100), B varchar(100), C varchar(100)
) stored as parquet location 's3://data/';
This defines the external schema:
create table local_data as select A, C from spectrum.Foo;
This will essentially load only the specified columns from the external table into a local table in Redshift. However, you don't need to do that: The advantage of Redshift Spectrum and Parquet is that when you run the query:
select A, C from spectrum.Foo;
it will only load the data from S3 for the accessed columns, reducing the number of IO operations to S3 and reducing the cost.