Search code examples
mysqlsqlindexingtime-complexityclustered-index

Performance of clustered index on SQL Query


Suppose you have two tables:

Student(id, class) // 100 rows
Course(id, course) // 100 rows

Initially assume that there is no index on both the tables. Now suppose we have a query:-

select id, course 
from Student join course 
on student.id = Course.id and student.id = 20

Since you don't have any index, so you need to traverse all the rows in both tables.

Time complexity - O(100 x 100)

Now we updated the table and Student.id is a primary key. Clustered index will be created on it and now the overall complexity is

Time complexity - O(log 100) // Nested loop join

Do you think my assumption is correct? Could anyone help me?

Nested loop join algo is here:

enter image description here


Solution

  • join course 
    on student.id = Course.id
    

    is correctly in O(MN) (Worst-Case), where M and N are the number of rows in the first and second table respectively, since that is an equi-join (Join on a = condition) it compares every row from the first with the second.

    However, you also have a second condition. Since SQL has many performance-enhancing algorithms, it is very likely that the student.id = 20 would be evaluated first. Then you would have firstly M (assume linear in the number of rows of the students table) to search for student.id = 20. Then, if student.id = 20 is only constant, let's say m, you would have m * N.

    All in all, O(M + (m * N)).

    That here depends now on m. If m is constant then in the asymptotic analysis O(M + N) = O(2M), since M=N and you end up with O(M) = O(N) or linear. Otherwise, if m is in Omega(1), then it would be O(M + M * N) or as you assumed O(MN).

    Then with regards to the PRIMARY KEY a clustered index will/can be created. The time complexity now for future queries would be as you said O(log K), where K are the rows in the new table (can be != 100).

    Now why log K? Because relational databases structure indices as B-trees. Then in WC you get O(log K) in the height of the tree.

    More precisely

    since on B-trees you have max. 2d children and number of keys s between d - 1 < s < 2d. d is called order, degree or branching factor of the tree.

    Hope it helps!