When I run this query in Athena query editor, it works as expected.
SELECT * FROM "sampledb"."elb_logs" limit 10;
elb_logs table has been generated based on the official tutorial. When I try to use spectrum in redshift, I can see all "NULL" values for all columns. I am creating athena_schema using these commands:
drop schema "athena_schema";
create external schema athena_schema from data catalog
database 'sampledb'
iam_role 'arn:aws:iam::94331XXXXXXX:role/RedshiftCopyUnload'
region 'ap-south-1';
And the output of system table:
select * from svv_external_tables;
schemaname tablename location input_format output_format serialization_lib serde_parameters compressed parameters
athena_schema elb_logs s3://athena-examples-ap-south-1/elb/plaintext org.apache.hadoop.mapred.TextInputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat org.apache.hadoop.hive.serde2.RegexSerDe {"input.regex":"([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) 0 {"EXTERNAL":"TRUE","transient_lastDdlTime":"1480278335"}
I am not sure why the athena console shows the correct values for all columns while redshift shows all NULLs?
This was because elb_logs table was using regular expression serialization that spectrum can not handle. I converted the table into parquet file format using this command.
CREATE TABLE elb_logs3
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
external_location = 's3://elb163/parqfiles'
) AS SELECT * from elb_logs
Now athena will have 2 tables "elb_logs" and "elb_logs3". Once I create the external schema using standard commands like this...
drop schema "athena_schema";
create external schema athena_schema from data catalog
database 'sampledb'
iam_role 'arn:aws:iam::XXX:role/RedshiftCopyUnload'
region 'us-east-1';
I can now select records from elb_logs table like this...
select * from athena_schema.elb_logs3 limit 10;
Note that selecting from elb_logs table still shows NULL values for all columns.