Import Parquet files with non-compliant field names into AWS Athena

I need to set-up Athena tables from Parquet files, where some columns have names not complying with Athena's SQL dialect, e.g.:

Longer than 128 characters
Multiple columns with names which only differ in case

Is there a way to tweak the way the Parquet file fields are mapped into SQL column names in the CREATE EXTERNAL TABLE statement without changing the underlying files, e.g. through SERDEPROPERTIES?

Solution

Athena can be configured to access fields by numerical index, thus completely ignoring the names they are given in the Parquet file. To do this, set the SerDes property 'parquet.column.index.access'='true'. Example:

CREATE EXTERNAL TABLE `source_data_IDX`(
  `0` string, 
  `1` bigint, 
  `2` double, 
  `3` bigint, 
  `4` string, 
  `5` string
)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ( 
  'parquet.column.index.access'='true') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://test-data/students_grades_dataset'
TBLPROPERTIES (
  'classification'='parquet', 
  'compressionType'='none', 
  'projection.enabled'='false', 
  'typeOfData'='file')

Then they can be queried, using column numbers, and optionally aliases which needn't conform to Athena's restrictions on column names, e.g:

SELECT "0" "NAME", "1" "age", "5" "name",
"2" "34gggggggggggggggHgggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg"
FROM "test_database"."source_data_idx" limit 10;

This solution applies to both Parquet and ORC files, and is documented in the Athena user guide.