Search code examples
sqlexceljoinvlookupvenn-diagram

How to understand joins when your mental reference is Excel's vlookup?


When I was new to SQL, I did some real intense thinking to understand how SQL joins work. At a certain point I realized that years of prior experience with -or say, exposure to, MS Excel's vlookup, got my thinking and understanding in the wrong way. Though there are a lot of good references on the net, it just didn't get to me. The Venn diagrams also didn't help me out at first. I post this question so hopefully others won't fall into the same mental-trap as I did. I'll also give an answer. Other suggestions are highly welcome too of course. Though it costed me some time to write this up, I just couldn't bare with myself if I did not post this...


Solution

  • When I started with SQL and joins, I just couldn't understand why, for example, the SQL's left join gave me more results than I would expect when I would have done a vlookup. The Venn diagrams did not help me, though I find them very clarifying now.

    After a week of real intense intense thinking I suddenly realized that my prior experience had paradigmed me: in the basic formula of vlookup, only one result is given back. In the basic expression of a left join, zero or more results are given back.

    I'll try to clarify. For example I have the beneath data and formulas (in column D) in an Excel sheet. The column headers are in the first row, so that's why the formula states A2, etc. For the purpose of simplicity I have the data in column A near B and C, in reality it would be better to have it somewhere else, because you could interpret the dataset as a table where every row is a record. That is not the case in this example. In other words: cell A3 (2) is not of the same record as B3 to C3 (1, Blue). For those who read this article I assume they understand the formula and also the 'false'-clause, elsewise start here.

    A    B    C        D                                (result in D) 
    1    1    Green    =vlookup(A2,B:C;2;false)        (result=Green)
    2    1    Blue     =vlookup(A3,B:C;2;false)        (result=Orange)
    3    2    Orange   =vlookup(A4,B:C;2;false)        (result=Yellow)
    4    3    Yellow   =vlookup(A5,B:C;2;false)        (result=#n/a)
    

    In column D2 you see that only 'Green' is given as a result. A join would give back 'Green' AND ' Blue'.

    See beneath a SQL query (dialect is t-sql) and the result.

    select A as AFromTableA
    ,B as BFromTableBC
    ,C as CFromTableBC
    from tableBC left outer join tableA
    on B=A;
    

    Result:

    AFromTableA BFromTableBC    CFromTableBC
    1           1               Green
    1           1               Blue
    2           2               Orange
    3           3               Yellow
    

    You see that both Green and Blue are given back in the result. That is because join searches for all results. Also see that B=4 is not return, that is because it is a left join starting from TableBC. If it started from TableA, where 4 in matter of fact IS in the table, a null result would be given back. Like this:

    AFromTableA BFromTableBC    CFromTableBC
    1           1               Green
    1           1               Blue
    2           2               Orange
    3           3               Yellow
    4           NULL            NULL
    

    Here is some scripting to do it yourself:

    create table tableA (A int)
    create table tableBC (B int, C nvarchar(10))
    
    insert into tableA
    values (1),(2),(3),(4);
    
    insert into tableBC (B,C)
    values (1, 'Green')
    ,(1, 'Blue')
    ,(2,'Orange')
    ,(3, 'Yellow');
    

    So, hopefully this answer helps you in your thinking. Now read some more clarifying information about joins:

    Difference between INNER and OUTER joins

    A Visual Explanation of SQL Joins