Search code examples
sqlhadoophivequery-optimizationhiveql

Guidance in re-writing query using NOT EXISTS


We have a query that runs for different sources, many times every day. Each time it runs for 40+ minutes and we are trying to see if we can reduce the query execution time

All the query does is tries to add new rows to the existing table ( by comparing on keys )

                    SELECT          A.*
                    FROM            
                    ( 
                                       SELECT *
                                       FROM   A 
                    )
                    A
                    LEFT OUTER JOIN 
                                    ( 
                                           SELECT f1, f11
                                           FROM   B 
                                           WHERE  f13 IN  (  SELECT f13 FROM   C) 
                                    ) 
                    B 
                    ON              A.f1 = B.f1
                    AND             nvl(A.f11,'NULL') = nvl(B.f11,'NULL')
                    WHERE           isnull(B.f1) 
                    AND             isnotnull(A.f1) 

What I am trying to do


                    SELECT          A.* .
                    FROM            ( 
                                           SELECT * 
                                           FROM   A 
                                   ) 
                    A
                    WHERE NOT EXISTS  
                                    ( 
                                           SELECT 1
                                           FROM   B  WHERE  f13 IN  (  SELECT f13 FROM   C)
                                           and A.f1 = B.f1 and A.f11 = B.f11

                                    ) 

Will this work to not only generate similar results but also help in reducing the execution time. A small improvement in query will go a long way as this query runs for 8 times every day.

Any suggestions will be helpful.


Solution

  • In Hive NOT EXISTS will generate the same plan as left join + isnull filter.

    You can use UNION + row_number() for incremental updates like in this answer https://stackoverflow.com/a/44755825/2700344, without join, it may work faster