I have an Athena table defined with a template specified like so in cloudformation:
Cloudformation Create
EventsTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref DatabaseName
TableInput:
Description: "My Table"
Name: !Ref TableName
TableType: EXTERNAL_TABLE
StorageDescriptor:
Compressed: True
Columns:
- Name: account_id
Type: string
Comment: "Account Id of the account making the request"
...
InputFormat: org.apache.hadoop.mapred.TextInputFormat
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
Location: !Sub "s3://${EventsBucketName}/events/"
This works well and deploys. I also found out I can have partition projections created as per this doc and this doc
And can make that work with a direct table creation, roughly:
SQL Create
CREATE EXTERNAL TABLE `performance_data.events`
(
`account_id` string,
...
)
PARTITIONED BY (
`day` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://my-bucket/events/'
TBLPROPERTIES (
'has_encrypted_data' = 'false',
'projection.enabled' = 'true',
'projection.day.type' = 'date',
'projection.day.format' = 'yyyy/MM/dd',
'projection.day.range' = '2020/01/01,NOW',
'projection.day.interval' = '1',
'projection.day.interval.unit' = 'DAYS',
'storage.location.template' = 's3://my-bucket/events/${day}/'
)
But I can't find the docs to convert into the cloud formation structure. So my question is, how can I achieve the partition projection shown in the SQL code in cloudformation?
I now have a working solution. The missing piece was really a missing parameter, here is the solution:
MyTableResource:
Type: AWS::Glue::Table
Properties:
CatalogId: MyAccountId
DatabaseName: MyDatabase
TableInput:
Description: "My Table"
Name: mytable
TableType: EXTERNAL_TABLE
PartitionKeys:
- Name: day
Type: string
Comment: Day partition
Parameters:
"projection.enabled": "true"
"projection.day.type": "date"
"projection.day.format": "yyyy/MM/dd"
"projection.day.range": "2020/01/01,NOW"
"projection.day.interval": "1"
"projection.day.interval.unit": "DAYS"
"storage.location.template": "s3://my-bucket/events/${day}/"
StorageDescriptor:
Compressed: True
Columns:
...
InputFormat: org.apache.hadoop.mapred.TextInputFormat
SerdeInfo:
Parameters:
serialization.format: '1'
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
Location: "s3://my-bucket/events/"
The key addition was:
serialization.format: '1'
This now completely works and one can do a query that using the partition as:
select * from mytable where day > '2022/05/03'