sql performance optimization snowflake-cloud-data-platform query-optimization

Speeding Up SQL Join

I have a table A that contains order data with a sequence number for timing. I have another table B that contains separate order data, also with a sequence number. The sequence numbers are synchronized between the two tables, meaning a higher sequence number in table A implies it came after a lower sequence number in table B.

My goal is that for each order in table A, I want to get the order from table B that came immediately before it. To accomplish this, I have also created a lead sequence number in table B like so:

SELECT 
    *
    , LEAD(sequence, 1, 999999999999) OVER (order by sequence) as lead_sequence
FROM 
    table_b

This is what I have so far on the join:

SELECT 
    *
FROM  
    table_a a  
INNER JOIN 
    table_b b
ON 
    a.sequence > b.sequence
AND 
    a.sequence < b.lead_sequence;

Logically, this works, but does not seem to be performant. Are there any clever ways to speed this up? I do not have any kind of contextual ID to link these two tables together aside from the sequence number and a separate timestamp.

Solution

great question:

so lets build a little code that generates some data, to allow repeated testing:

with orders as (
    select 
        random() as r
        ,x
    from (
        values ('a'),('b')
    ) as t(x)
    ,table(generator(rowcount=>10))
)
select o.*,
    row_number() over (order by r) as rn
from orders as o
order by rn;

gives something that sounds like what you are saying:

so we can turn that into a table, with 100,000 rows instead of 20, and then from there pull out the A table, and the B table, then we have so tables, where there is some A rows that come before the B rows.

create or replace table source_rows as
with orders as (
    select 
        random() as r
        ,x
    from (
        values ('a'),('b')
    ) as t(x)
    ,table(generator(rowcount=>100000))
)
select o.*,
    row_number() over (order by r) as rn
from orders as o
order by rn;

create or replace table a_rows as  
select x, rn
from source_rows 
where x = 'a';

create or replace table b_rows as  
select x, rn
from source_rows 
where x = 'b';

select count(*) from source_rows;
-- 200000
select count(*) from a_rows;
-- 100000
select count(*) from b_rows;
-- 100000

So if we run your SQL as it stands:

SELECT 
    *
FROM a_rows a  
JOIN (
    SELECT 
        *
        ,LEAD(rn, 1, 99999999999) OVER (order by rn) as lead_rn
    FROM b_rows
) b
    ON a.rn > b.rn
        AND a.rn < b.lead_rn;

which on my 100K rows (in both tables) took 107 seconds

So that range join in the ON can be swapped for a BETWEEN, and I often would use BETWEEN and a LESS THAN, like:

    ON a.rn between> b.rn AND b.lead_rn
       AND a.rn > b.rn
       AND a.rn < b.lead_rn

to avoid the edges which you have excluded, but because you have states they come from the same sequence, there will not be doubles, we can just use the BETWEEN, as the edges will not exist.

    ON a.rn between> b.rn AND b.lead_rn
       AND a.rn > b.rn
       AND a.rn < b.lead_rn

Now for the same test data, the above took 109 seconds, so that is unexpected by me. As I have seen it perform better in the past.

Check the results are the same:

select * from table(result_scan('01b0d613-0004-dbe1-0001-d15f00051096'))
minus 
select * from table(result_scan('01b0d60d-0004-db2d-0001-d15f0005504a'));

select * from table(result_scan('01b0d60d-0004-db2d-0001-d15f0005504a'))
minus
select * from table(result_scan('01b0d613-0004-dbe1-0001-d15f00051096'));

gives no missing results, so that is nice to see.

So lets remove the LEAD from the mix:

create or reaplce table lead_b_rows as 
    SELECT 
        *
        ,LEAD(rn, 1, 99999999999) OVER (order by rn) as lead_rn
    FROM b_rows;

SELECT 
    *
FROM a_rows a  
JOIN lead_b_rows b
    ON a.rn > b.rn
        AND a.rn < b.lead_rn;

ok, this also took 109 seconds, so this implies the LEAD was free, and 107,109 is all in the order of "the same time", which makes sense, and we are looking for 10%+ improvements.

Now something that makes my brain tingle is that there is exactly 100K rows. I would expect some streaks of ABBBAAB thus that fourth B would bind to the two prior A's, and the first SQL I wrote to check this counts A's, but now I type all this out, I see I want to count double bound B's

select rn_1, count(*) as c_b_rn
from table(result_scan('01b0d613-0004-dbe1-0001-d15f00051096'))
group by 1
having c_b_rn > 1
order by c_b_rn desc;

ah, that better:

so in this made up data, there are B record's that are bound to many different A records. Which I suspect is not your intended use case. At which point those extra matches might need to be stripped out, which would add more time cost, but also is not 100% clear you want this behavior, so I might ignore this path.

the bucketing path:

for this data we know the largest gap is 15, so we can use 3 buckets of 10 (0,10,20) and know that we will cover the span of the data, and thus given A is binding the to most recent B prior, we floor the a rn to 10, and do the same to the b's row numbers, but also add 0, 10, 20 (1,2,3 are used as they are after dividing by 10, thus need to also be in the correct value space, but 0,10,20 could be used if the addition was done prior to the division), and then join on those equi-join style, and then filter a.rn > b.rn to avoid "future" b's, and filter the multi bucket matches per a.rn via a qualify:

with a_bucket as (
    SELECT 
        *
        ,floor(rn/10) as bucket
    FROM a_rows
where rn < 100
), b_bucket as (
    SELECT 
        *
        ,floor(rn/10)+ t.d as bucket
    FROM b_rows
    CROSS JOIN (values (0),(1),(2)) as t(d)
where rn < 100
)--, bucketed as (
SELECT 
    a.x as a_x
    ,a.rn as a_rn
    ,b.x as b_x
    ,b.rn as b_rn
FROM a_bucket a  
JOIN b_bucket b
    on a.bucket = b.bucket
        and a.rn > b.rn
qualify row_number() over (partition by a_rn order by b_rn desc) = 1
order by a_rn

so we can see, that a(2) binds to b(1) a(,4,5,6) bind to b(3) and a(9) binds to b(8) implying there was also a b(7) that was skipped. So it seems to work nicely.

We can remove those testing filters where rn < 100 and see how it works over the larger 100K dataset:

Well I was not expecting 0.7 seconds, that is an improvement..

we better check the results are all the same, and they are...

so if you can workout a "safe" margin that will always cover your bucket window, you can have some large savings. to this end, for my example code, I changed to values (0),(1),(2),(3),(4),(5),(6) and it still was 750ms. And changing the source size from 100K to 1M the above code ran in 3.1s version your original join, I stopped after 10 minutes.

over matching:

so in the bucketing code above A 4,5,6 all matched B 3, now if we really want B3 to only match a4, and a5 have now match, we can add another qualify after the bucketing has completed,

with a_bucket as (
    SELECT 
        *
        ,floor(rn/10) as bucket
    FROM a_rows
where rn < 100
), b_bucket as (
    SELECT 
        *
        ,floor(rn/10)+ t.d as bucket
    FROM b_rows
    CROSS JOIN (values (0),(1),(2),(3),(4),(5),(6)) as t(d)
where rn < 100
), bucketed as (
    SELECT 
        a.x as a_x
        ,a.rn as a_rn
        ,b.x as b_x
        ,b.rn as b_rn
    FROM a_bucket a  
    JOIN b_bucket b
        on a.bucket = b.bucket
            and a.rn > b.rn
    qualify row_number() over (partition by a_rn order by b_rn desc) = 1
)
select * from bucketed
qualify row_number() over (partition by b_rn order by a_rn ) = 1
order by a_rn

with that last qualify excluded (this is new data, because I went to 1M rows (sigh, new I should have made new tables)):

and with filter in place:

So the profile of the 1M results is worth looking at also:

of the 3.7 seconds for this query, the join of the ~7M table b bucketed values, only 9% is the join cost. The resulting 32M rows, then are window functioned by the row_number for 45% of the time, so you can see how the equi join on pre-processed data, is the path to speed. And how row based logic like of corelated sub-queries do not scale.