Search code examples
amazon-web-servicesamazon-redshiftamazon-redshift-spectrum

Skipping header rows in AWS Redshift External Tables


I have a file in S3 with the following data:

name,age,gender
jill,30,f
jack,32,m

And a redshift external table to query that data using spectrum:

create external table spectrum.customers ( 
 "name" varchar(50),
 "age" int,
 "gender" varchar(1))
row format delimited
fields terminated by ','
lines terminated by \n'
stored as textfile 
location 's3://...';

When querying the data I get the following result:

select * from spectrum.customers;
name,age,g
jill,30,f
jack,32,m

Is there an elegant way to skip the header row as part of the external table definition, similar to the tblproperties ("skip.header.line.count"="1") option in Hive? Or is my only option (at least for now) to filter out the header rows as part of the select statement?


Solution

  • Answered this in: How to skip headers when we are reading data from a csv file in s3 and creating a table in aws athena.

    This works in Redshift:

    You want to use table properties ('skip.header.line.count'='1') Along with other properties if you want, e.g. 'numRows'='100'. Here's a sample:

    create external table exreddb1.test_table
    (ID BIGINT 
    ,NAME VARCHAR
    )
    row format delimited
    fields terminated by ','
    stored as textfile
    location 's3://mybucket/myfolder/'
    table properties ('numRows'='100', 'skip.header.line.count'='1');