Search code examples
sqletldatabase-performancequery-performancevertica

INTERSECT two table of size 500ml rows in vertica


I am very new to vertica db and hence looking for different efficient ways for comparing two tables of average size 500ml-800ml rows in vertica. I have a process that gets the data from vertica view and dump in to SQL server for later merge to final table in sql server. for few large tables combine it is dumping about 3bl rows daily. Instead of dumping all data I want to take daily snapshot, and compare it with previous days snapshot on vertica side only and then push changed rows only in to SQL SEREVER.

lets say previous snapshot is stored in tableA, today's snapshot stored in tableB. PK on both table is column named OrderId.

Simplest way I can think of is

Select * from tableB
Where OrderId NOT IN (

SELECT * from tableA
INTERSECT
SELECT * from tbleB
)

So my questions are:

  1. Is there any other/better option in vertica to get only changed rows between two tables? Or should I even consider doing this compare on vertica side?
  2. How much doing such comparison should take?
  3. What should I consider to improve the performance of such query?

Solution

  • If your columns have no NULL values, then a massive LEFT JOIN would seem to do what you want:

    select b.*
    from tableB b left join
         tableA a
         on b.OrderId = a.OrderId and
            b.col1 = a.col1 and
            . . .   -- for all the columns you care about
    

    However, I think you want except:

    select b.*
    from tableB b
    except
    select a.*
    from tableA a;
    

    I imagine this would have reasonable performance.