Search code examples
sqlhiveparquetamazon-athenapresto

Import Parquet files with non-compliant field names into AWS Athena


I need to set-up Athena tables from Parquet files, where some columns have names not complying with Athena's SQL dialect, e.g.:

  • Longer than 128 characters
  • Multiple columns with names which only differ in case

Is there a way to tweak the way the Parquet file fields are mapped into SQL column names in the CREATE EXTERNAL TABLE statement without changing the underlying files, e.g. through SERDEPROPERTIES?


Solution

  • Athena can be configured to access fields by numerical index, thus completely ignoring the names they are given in the Parquet file. To do this, set the SerDes property 'parquet.column.index.access'='true'. Example:

    CREATE EXTERNAL TABLE `source_data_IDX`(
      `0` string, 
      `1` bigint, 
      `2` double, 
      `3` bigint, 
      `4` string, 
      `5` string
    )
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    WITH SERDEPROPERTIES ( 
      'parquet.column.index.access'='true') 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://test-data/students_grades_dataset'
    TBLPROPERTIES (
      'classification'='parquet', 
      'compressionType'='none', 
      'projection.enabled'='false', 
      'typeOfData'='file')
    

    Then they can be queried, using column numbers, and optionally aliases which needn't conform to Athena's restrictions on column names, e.g:

    SELECT "0" "NAME", "1" "age", "5" "name",
    "2" "34gggggggggggggggHgggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg"
    FROM "test_database"."source_data_idx" limit 10;
    

    This solution applies to both Parquet and ORC files, and is documented in the Athena user guide.