Search code examples
sqlvertica

Joins, conditions and speed in SQL


While preparing some requests, I was writing this :

SELECT *
FROM ta A
JOIN tb B
    ON A.col1 = B.col1 
JOIN tc C
    ON B.col2 = C.col2 
WHERE B.col3 = 'whatever'
AND C.col4 = 'whatever2'

And I began to think about the following :

SELECT *
FROM ta A
JOIN (SELECT * FROM tb WHERE col3 = 'whatever') B
    ON A.col1 = B.col1
JOIN (SELECT * FROM tc WHERE col4 = 'whatever2') C
    ON B.col2 = C.col2

(If I'm not mistaken, the result would be the same). I'm wondering if it would be significantly faster ? My guess is that it would but I'd be interested in knowing why/why not ?

(Because our server is down at the moment, I can't test it myself right now, so I'm asking here, I hope you won't mind.)

(In case it matters, the engine is Vertica, but my question isn't really specific to Vertica)


Solution

  • Your second query is a little off, it should be:

    SELECT *
    FROM ta A
    JOIN (SELECT * FROM tb WHERE tb.col3 = 'whatever') B
        ON A.col1 = B.col1
    JOIN (SELECT * FROM tc WHERE tc.col4 = 'whatever2') C
        ON B.col2 = C.col2
    

    Notice the inline view where clauses need to reference the table in scope, not the alias for the view. B and C are out of scope within the inline views.

    In any case, because you are doing an inner join, it won't matter from a results perspective because the condition is the same whether it occurs pre-join or post-join.

    You can reasonably rely on the optimizer to do the following:

    1. Only materialize the columns required when needed.
    2. Push predicates down where it makes sense

    That said, there should be no difference between the two statements. Most likely it is pushing down predicates for the first one to make it more like the second one. If you have statistics gathered, the optimizer should be smart enough to query these the same way (or really close).

    That isn't to say I haven't seen what you have in your second query "fix" query issues for me in Vertica... but usually it's only when I am using multiple COUNT(DISTINCT ...) expressions or theta joins, etc.

    Now if this were an outer join, then the statements would be different. The first one would apply the filter after the join, the second would be before the join.

    Of course, I'll mention that you really just need to do an explain of both methods. Just make sure statistics are gathered.

    Hope it helps.