Search code examples
sqlamazon-web-servicesamazon-athenapresto

Athena Query - "NOT IN" slow performance


Context

I am building a query using Athena and I noted that using "NOT IN" was having a significant impact on the runtime of the query. The query is performing a join between two tables and aims to filter the rows of the result by excluding a set of ids.

The id is NOT unique. It is possible it will be present across multiple rows. Essentially I wrote the subquery as I want to ignore ANY/ALL of the rows with a given id value if the rows meet the condition in the subquery.

In other words, for a given id "24324" which has 5 rows in the database, we want to filter out ALL of the rows if ANY of the individual rows values meet the conditions in the subquery.

My question

Is there a more efficient way to structure this query that I have not considered to reduce the runtime?

The query

Table and column names have been sanitised with placeholder/generic values.

SELECT *
FROM table_a
INNER JOIN table_b ON table_b.external_id = table_a.external_id 
                   AND table_b.client_name = table_a.client
-- These are the ids we do not want in the final result.
WHERE table_b.id NOT IN (SELECT distinct(table_b.id)
                         FROM table_b
                         WHERE table_b.decision IS NOT NULL 
                            OR table_b.submission_time IS NOT NULL)

Solution

  • filter the null records during the join condition. It will bring lesser records to join condition in first place. This will help in performance improvement, try to optimize the join. Bigger table on the left side, joining smaller table on the right will help improving performance.

    SELECT *
    FROM table_a
    INNER JOIN table_b ON table_b.external_id = table_a.external_id 
                       AND table_b.client_name = table_a.client
    WHERE NOT EXISTS (SELECT 1
                     FROM table_bb
                     WHERE (table_bb.decision IS NOT NULL 
                        OR table_bb.submission_time IS NOT NULL)
                     and table_b.id = table_bb.id)