Search code examples
amazon-web-servicesamazon-athena

How to get latest data by AWS Glue


I manage some data in AWS, and there are some parquet files in a S3 bucket. Everyday, new files will added to this bucket, and I would like to get the data in latest file by using Athena.

I want to know how to designate the latest file path in Athena Query. Is it possible to recognize the latest file from path of each parquet file?


Solution

  • Presto DB (now Trino) is the engine on which Athena is based. Support for querying the file timestamp has been recently added, but it's likely to take a while before it's available on Athena (probably years).

    In the meantime, if your parquet files include a timestamp in the name you could do something like:

    select * from mydb 
    where "$path" in 
    (
       select "$path" 
       from my db
       order by "$path" desc 
       limit 1
    )