Search code examples
amazon-web-servicesaws-cloudformationamazon-athena

Add partition projection to AWS Athena table using Cloudformation


I have an Athena table defined with a template specified like so in cloudformation:

Cloudformation Create

EventsTable:
  Type: AWS::Glue::Table
  Properties:
    CatalogId: !Ref AWS::AccountId
    DatabaseName: !Ref DatabaseName
    TableInput:
      Description: "My Table"
      Name: !Ref TableName
      TableType: EXTERNAL_TABLE
      StorageDescriptor:
        Compressed: True
        Columns:
          - Name: account_id
            Type: string
            Comment: "Account Id of the account making the request"
            ...
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        SerdeInfo:
          SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
        OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
        Location: !Sub "s3://${EventsBucketName}/events/"


This works well and deploys. I also found out I can have partition projections created as per this doc and this doc

And can make that work with a direct table creation, roughly:

SQL Create

CREATE EXTERNAL TABLE `performance_data.events`
(
  `account_id`  string,
...
)
   PARTITIONED BY (
     `day` string)
    ROW FORMAT SERDE
        'org.openx.data.jsonserde.JsonSerDe'
    STORED AS INPUTFORMAT
        'org.apache.hadoop.mapred.TextInputFormat'
        OUTPUTFORMAT
          'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    LOCATION
        's3://my-bucket/events/'
    TBLPROPERTIES (
        'has_encrypted_data' = 'false',
        'projection.enabled' = 'true',
        'projection.day.type' = 'date',
        'projection.day.format' = 'yyyy/MM/dd',
        'projection.day.range' = '2020/01/01,NOW',
        'projection.day.interval' = '1',
        'projection.day.interval.unit' = 'DAYS',
        'storage.location.template' = 's3://my-bucket/events/${day}/'
)

But I can't find the docs to convert into the cloud formation structure. So my question is, how can I achieve the partition projection shown in the SQL code in cloudformation?


Solution

  • I now have a working solution. The missing piece was really a missing parameter, here is the solution:

    
    MyTableResource:
      Type: AWS::Glue::Table
      Properties:
        CatalogId: MyAccountId
        DatabaseName: MyDatabase
        TableInput:
          Description: "My Table"
          Name: mytable
          TableType: EXTERNAL_TABLE
          PartitionKeys:
            - Name: day
              Type: string
              Comment: Day partition
          Parameters:
            "projection.enabled": "true"
            "projection.day.type": "date"
            "projection.day.format": "yyyy/MM/dd"
            "projection.day.range": "2020/01/01,NOW"
            "projection.day.interval": "1"
            "projection.day.interval.unit": "DAYS"
            "storage.location.template":  "s3://my-bucket/events/${day}/"
    
    
          StorageDescriptor:
            Compressed: True
            Columns:
              ...
    
            InputFormat: org.apache.hadoop.mapred.TextInputFormat
            SerdeInfo:
              Parameters:
                serialization.format: '1'
              SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
            OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
            Location: "s3://my-bucket/events/"
    

    The key addition was:

    serialization.format: '1'
    

    This now completely works and one can do a query that using the partition as:

    
    select * from mytable where day > '2022/05/03'