I want to archive my logs into the Parquet format. Before writing the table, I want to sort it by a column c
so that each Parquet file will only have a small range of c
. That will allow Athena / Presto to efficiently scan the table when a query includes a WHERE clause on column c
(via predicate pushdown).
However, it's unclear to me whether I can use Athena or Presto to sort the entire table. I need a distributed sort - not one that takes place on a single node - because the dataset is too big to fit on a single node. Is such a sort possible? If so, how to I invoke it?
Presto supports distributed sort since 0.206. Athena is currently based on Presto 0.172 and I don't know if they backported this feature.
So your choices are