Search code examples
parquetamazon-athenapresto

Can I use Athena / Presto to sort a table before writing?


I want to archive my logs into the Parquet format. Before writing the table, I want to sort it by a column c so that each Parquet file will only have a small range of c. That will allow Athena / Presto to efficiently scan the table when a query includes a WHERE clause on column c (via predicate pushdown).

However, it's unclear to me whether I can use Athena or Presto to sort the entire table. I need a distributed sort - not one that takes place on a single node - because the dataset is too big to fit on a single node. Is such a sort possible? If so, how to I invoke it?


Solution

  • Presto supports distributed sort since 0.206. Athena is currently based on Presto 0.172 and I don't know if they backported this feature.

    So your choices are