Search code examples
sqlamazon-athenaprestotrino

Does Presto cache intermediate results internally out of the box?


Presto has multi connectors. While the connectors do implement read and write operations, from all the tutorials I read, it seems they are typically used as data sources to read from only. For example, netflix has "10 petabyte" of data on Amazon S3 and they explicitly state that no disk (and no HDFS) is used on the Presto worker nodes. The stated use case are "ad hoc interactive" queries.

Also, Amazon Athena is essentially S3+Presto and comes with similar use cases.

I'm puzzled how this can work in practice. Obviously, you don't want to read 10 PB of data on every query. So I assume, you want to keep some previously fetched data in memory, such as a database index. However, with no constraints on the data and the queries, I fail to understand how this can be efficient.

Use case 1: I run the same query frequently, e.g. to show metric on a dashboard. Does Presto avoid rescanning the data points which are already 'known'?

Use case 2: I'm analysing a large data set. Each query is slightly different, however there are common subqueries or we filter to a common subset of the data. Does Presto learn from previous queries and carry over intermediate results?

Or, if this is not the case, would I be well advised to store intermediate results somewhere (e.g. CREATE TABLE AS ...)?


Solution

  • As far as I know there is no intermediate implicit caching layer. When you use HDFS on your cluster, you surely benefit from OS disk caches, so next query run will be faster, but you won't get instant cached results. Similar data block-level caching might apply to S3 too.

    Generally, no reasonably-sized system can sift through 10 petabytes of data, since reading all that data would take a lot of time. However, data can be partitioned so that Presto knows more or less which pieces of data need to be scanned. When partitioning aligns with query conditions (e.g. you partition data by data and you query for most recent data), this can work really well.

    When your data is not partitioned the same way you query, and you don't want to re-partition it differently, saving temporary results with create table ... as select makes much sense. You can also store such temporary tables using some in-memory storage, e.g. raptor (currently undocumented) or memory connectors for even faster access.

    For some starter tips about partitioning, tuning storage and queries you can have a look at https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/.