Search code examples
prestoamazon-athenatrino

Athena sub-query and LEFT JOIN data scanned optimization


There is a table with parquet data format of 20 GB and simple query will give results by scanning only 1GB of data. select columns from table1 where id in (id1, id2, idn)

If same query is executed with a sub-query such as - select columns from table1 where id in (select id from table2 limit n) This query will give results by scanning 20GB, whole the table.Even n is very small number as 10, 50 or 5000.

Same happen with LEFT JOIN.

SELECT table1.* FROM
table2 LEFT JOIN table1
ON table2.id=table1.id 

Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query? Any best practices of How currently users runs LEFT JOIN or sub-query without full table scan on Athena ?

Similar questions- Question -1, Question -2


Solution

  • Is there a way to achieve this by running single query instead of fetch and save result of sub-query and pass as args into another query?

    This is most commonly covered by "Dynamic filtering". Currently there is no way to do this.

    Athena is based on Presto and Presto doesn't support dynamic filtering yet, but will likely support it in the next release (Presto 321). You can track the issue here: https://github.com/prestosql/presto/issues/52

    Athena is based on Presto 0.172 currently, so it still needs to upgrade.