Search code examples
amazon-web-serviceskey-valueparquetcreate-tableamazon-athena

How to load key-value pairs (MAP) into Athena from Parquet file?


I have an S3 bucket full of .gz.parquet files. I want to make them accessible in Athena. In order to do this I am creating a table in Athena that points at the s3 bucket:

CREATE EXTERNAL TABLE user_db.table (
pan_id bigint,
dev_id bigint,
parameters ?????, 
start_time_local bigint
                                     )
STORED AS PARQUET
LOCATION ‘s3://bucket/path/to/folder/containing_files/’
tblproperties (“parquet.compression”=“GZIP”)
;

How do I correctly specify the data type for the parameters column?

Using # parquet-tools schema, I see the following schema of the data files:

  optional int64 pan_id;
  optional int64 dev_id;
  optional group parameters (MAP) {
    repeated group key_value {
      required binary key (UTF8);
      optional binary value (UTF8);
    }
  }
  optional int96 start_time_local;

Using # parquet-tools head, I see the following value for one row of data:

pan_id = 1668490
dev_id = 6843371
parameters:
.key_value:
..key = doc_id
..value = c2bd3593d7015fb912d4de229a302379babcf6a00a203fcf
.key_value:
..key = variables
..value = {“video_id”:“2313675068886132",“surface”:“post”}
start_time_local = QFOHvvYvAAAzhCUA

I appreciate any help you can give. I have not been able to find good documentation for the MAP datatype being used in CREATE TABLE.


Solution

  • Maps are declared as map<string,string> (for string-to-string maps, other types are also possible), in your case the whole table DDL would be:

    CREATE EXTERNAL TABLE user_db.table (
      pan_id bigint,
      dev_id bigint,
      parameters map<string,string>, 
      start_time_local bigint
    )
    STORED AS PARQUET
    LOCATION 's3://bucket/path/to/folder/containing_files/'
    tblproperties ("parquet.compression" = "GZIP")
    

    The map type is the second to last in the list in the list of Athena data types