Search code examples
sql-serverindexinghashbinary-search-tree

Why is hash index slower using "Less than" in SQL


I've finished my first semester in a college-level SQL course where we used "SQL queries for Mere Mortals" 3rd edition.

Long term I want to work in data governance or as a data scientist, so digging deeper is needed and I found the Stanford SQL course. Today taking the first mini quiz, I got the answers right but on these two I'm not understanding WHY I got the answers right.

My 'SQL for Mere Mortals' book doesn't even cover hash or tree-based indexes so I've been searching online for them.

I mostly guessed based on what she said but it feels more like luck than "I solidly understand why". So I've ordered "Introduction to Algorithms" 3rd edition by Thomas Cormen and it arrived last week but it will take me a while to read through all 1,229 pages.

Found that book in this other stackoverflow link =>https://stackoverflow.com/questions/66515417/why-is-hash-function-fast

Stanford Course => https://www.edx.org/course/databases-5-sql

I thought a hash index on College.enrollment would not speed up because they limit it to less than a number vs an actual number ?? I'm guessing per this link Better to use "less than equal" or "in" in sql query that the query would be faster if we used "<=" rather than "<" ? enter image description here

This one was just a process of elimination as it mentions the first item after the WHERE clause, but then was confusing as it mentions the last part of Apply.cName = College.cName. enter image description here

My questions:

  1. I'm guessing that similar to algebra having numerators and denominators, quotients, and many other terms that specifically describe part of an equation using technical terms. How would you use technical terms to describe why these answers are correct.

  2. On the second question, why is the first part of the second line referenced and the last part of the same line referenced as the answers. Why didn't they pick the first part of each of the last part of each?

For context, most of my SQL queries are written for PostgreSQL now within PyCharm on python but I do a lot of practice using the PgAgmin4 or MySqlWorkbench desktop platforms.

I welcome any recommendations you have on paper books or pdf's that have step-by-step tutorials as many, many websites have holes or reference technical details that are confusing.

Thanks


Solution

  • 1. A hash index is only useful for equality matches, whereas a tree index can be used for inequality (< or >= etc).

    With this in mind, College.enrollment < 5000 cannot use a hash index, as it is an inequality. All other options are exact equality matches.

    This is why most RDBMSs only let you create tree-based indexes.


    2. This one is pretty much up in the air.

    "the first item after the WHERE clause" is not relevant. Most RDBMSs will reorder the joins and filters as they see fit in order to match indexes and table statistics.

    I note that the query as given is poorly written. It should use proper JOIN syntax, which is much clearer, and has been in use for 30 years already.

    SELECT *   -- you should really specify exact columns
    FROM Student AS s    -- use aliases
    JOIN [Apply] AS a ON a.sID = s.sID   -- Apply is a reserved keyword in many RDBMS
    JOIN College AS c ON c.cName = a.aName
    WHERE s.GPA > 1.5 AND c.cName < 'Cornell';
    

    Now it's hard to say what a compiler would do here. A lot depends on the cardinalities (size of tables) in absolute terms and relative to each other, as well as the data skew in s.GPA and c.cName.

    It also depends on whether secondary key (or indeed INCLUDE) columns are added, this is clearly not being considered.

    Given the options for indexes you have above, and no other indexes (not realistic obviously), we could guesstimate:

    • Student.sID, College.cName
      This may result in an efficient backwards scan on College starting from 'Cornell', but Apply would need to be joined with a hash or a naive nested loop (scanning the index each time).
      The index on Student would mean an efficient nested loop with an index seek.
    • Student.sID, Student.GPA Is this one index or two? If it's two separate indexes, the second will be used, and the first is obviously going to be useless. Apply and College will still need heavy joins.
    • Apply.cName, College.cName
      This would probably get you a merge-join on those two columns, but Student would need a big join.
    • Apply.sID, Student.GPA
      Student could be efficiently scanned from 1.5, and Apply could be seeked, but College requires a big join.

    Of these options, the first or the last is probably better, but it's very hard to say without further info.

    In a real system, I would have indexes on all tables, and use INCLUDE columns wisely in order to avoid key-lookups. You would want to try to get a better feel for which tables are the ones that need to be filtered early etc.