I am reading Effective Mysql - Optimizing Mysql Statements
and in chapter 3
there was this explanation:
The secondary indexes in InnoDB use the B-tree data structure; however, they differ from the MyISAM implementation. In InnoDB, the secondary index stores the physical value of the primary key. In MyISAM, the secondary index stores a pointer to the data that contains the primary key value.
This is important for two reasons. First, the size of secondary indexes in InnoDB can be much larger when a large primary key is defined—for example when your primary key in InnoDB is 40 bytes in length. As the number of secondary indexes increase, the comparison size of the indexes can become significant. The second difference is that the secondary index now includes the primary key value and is not required as part of the index. This can be a significant performance improvement with table joins and covering indexes.
There are many questions that come to my mind, mostly due to lack of understanding of what author is trying to convey.
It is unclear what the author means in the second difference in second paragraph. What is not required as part of index anymore?
Does InnoDB secondary index B-tree only store PK value or PK value and Pointer to it? or PK Value and pointer to data row?
What kind of performance improvement would there be due to the storage method (2nd question's answer)?
This question contains an example and also an answer. He explains how it contains PK value, but what I am still not understanding is,
To complete the join, if the pointer is not there in the secondary index and only the value, wont MySQL do a full index scan on Primary Key index with that value from secondary index? How would that be efficient than having the pointer also?
The secondary index is an indirect way to access the data. Unlike the primary (clustered) index, when you traverse the secondary index in InnoDB and you reach the leaf node you find a primary key value for the corresponding row the query is looking for. Using this value you traverse the primary index to fetch the row. This means 2 index look ups in InnoDB.
For MyISAM because the leaf of the secondary node is a pointer to the actual row you only require 1 index lookup.
The secondary index is formed based on certain attributes of your table that are not the PK. Hence the PK is not required to be part of the index by definition. Whether it is (InnoDB) or not (MyISAM) is implementation detail with corresponding performance implications.
Now the approach that InnoDB follows might at first seem inefficient in comparison to MyISAM (2 lookups vs 1 lookup) but it is not because the primary index is kept in memory so the penalty is low.
But the advantage is that InnoDB can split and move rows to optimize the table layout on inserts/updates/deletes of rows without needing to do any updates on the secondary index since it does not refer to the affected rows directly