I need to set-up Athena tables from Parquet files, where some columns have names not complying with Athena's SQL dialect, e.g.:
Is there a way to tweak the way the Parquet file fields are mapped into SQL column names in the CREATE EXTERNAL TABLE
statement without changing the underlying files, e.g. through SERDEPROPERTIES?
Athena can be configured to access fields by numerical index, thus completely ignoring the names they are given in the Parquet file. To do this, set the SerDes property 'parquet.column.index.access'='true'
. Example:
CREATE EXTERNAL TABLE `source_data_IDX`(
`0` string,
`1` bigint,
`2` double,
`3` bigint,
`4` string,
`5` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'parquet.column.index.access'='true')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test-data/students_grades_dataset'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='none',
'projection.enabled'='false',
'typeOfData'='file')
Then they can be queried, using column numbers, and optionally aliases which needn't conform to Athena's restrictions on column names, e.g:
SELECT "0" "NAME", "1" "age", "5" "name",
"2" "34gggggggggggggggHgggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg"
FROM "test_database"."source_data_idx" limit 10;
This solution applies to both Parquet and ORC files, and is documented in the Athena user guide.