As InnoDB organizes its data in B+ trees. The height of the tree affects the count of IO times which may be one of the main reasons that DB slows down.
So my question is how to predicate or calculate the height of the B+ tree (e.g. based on the count of pages which can be calculated by row size, page size, and row number), and thus to make a decision whether or not to partition the data to different masters.
Simpler computation
The quick and dirty answer is log base 100, rounded up. That is, each node in the BTree has about 100 leaf nodes. In some circles, this is called fanout.
1K rows: 2 levels 1M rows: 3 levels billion: 5 levels trillion: 6 levels
These numbers work for "average" rows or indexes. Of course, you could have extremes of about 2 or 1000 for the fanout.
Exact depth
You can find the actual depth from some information_schema:
For Oracle's MySQL:
$where = "WHERE ( ( database_name = ? AND table_name = ? )
OR ( database_name = LOWER(?) AND table_name = LOWER(?) ) )";
$sql = "SELECT last_update,
n_rows,
'Data & PK' AS 'Type',
clustered_index_size * 16384 AS Bytes,
ROUND(clustered_index_size * 16384 / n_rows) AS 'Bytes/row',
clustered_index_size AS Pages,
ROUND(n_rows / clustered_index_size) AS 'Rows/page'
FROM mysql.innodb_table_stats
$where
UNION
SELECT last_update,
n_rows,
'Secondary Indexes' AS 'BTrees',
sum_of_other_index_sizes * 16384 AS Bytes,
ROUND(sum_of_other_index_sizes * 16384 / n_rows) AS 'Bytes/row',
sum_of_other_index_sizes AS Pages,
ROUND(n_rows / sum_of_other_index_sizes) AS 'Rows/page'
FROM mysql.innodb_table_stats
$where
AND sum_of_other_index_sizes > 0
";
For Percona:
/* to enable stats:
percona < 5.5: set global userstat_running = 1;
5.5: set global userstat = 1; */
$sql = "SELECT i.INDEX_NAME as Index_Name,
IF(ROWS_READ IS NULL, 'Unused',
IF(ROWS_READ > 2e9, 'Overflow', ROWS_READ)) as Rows_Read
FROM (
SELECT DISTINCT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME
FROM information_schema.STATISTICS
) i
LEFT JOIN information_schema.INDEX_STATISTICS s
ON i.TABLE_SCHEMA = s.TABLE_SCHEMA
AND i.TABLE_NAME = s.TABLE_NAME
AND i.INDEX_NAME = s.INDEX_NAME
WHERE i.TABLE_SCHEMA = ?
AND i.TABLE_NAME = ?
ORDER BY IF(i.INDEX_NAME = 'PRIMARY', 0, 1), i.INDEX_NAME";
(Those give more than just the depth.)
PRIMARY
refers to the data's BTree. Names like "n_diff_pfx03" refers to the 3rd level of the BTree; the largest such number for a table indicates the total depth.
Row width
As for estimating the width of a row, see Bill's answer. Here's another approach:
Non-leaf nodes, plus index leaf nodes, are trickier because you need to understand exactly what represents a "row" in such nodes.
(Hence, my simplistic "100 rows per node".)
But who cares?
Here's another simplification that seems to work quite well. Since disk hits are the biggest performance item in queries, you need to "count the disk hits" as the first order of judging the performance of a query.
But look at the caching of blocks in the buffer_pool. A parent node is 100 times as likely to be recently touched as the child node.
So, the simplification is to "assume" that all non-leaf nodes are cached and all leaf nodes need to be fetched from disk. Hence the depth is not nearly as important as how many leaf node blocks are touched. This shoots down your "35% speedup" -- Sure 35% speedup for CPU, but virtually no speedup for I/O. And I/O is the important component.
Note that if you fetching the latest 20 rows of a table that is chronologically stored, they will be found in the last 1 (or maybe 2) blocks. If they are stored by a UUID, it is more likely to tale 20 blocks -- many more disk hits, hence much slower.
Secondary Indexes
The PRIMARY KEY
is clustered with the data. That implies that a look by the PK needs to drill down one BTree. But a secondary index is implemented by a second BTree -- drill down it to find the PK, then drill down via the PK. When "counting the disk hits", you need to consider both BTrees. And consider the randomness (eg, for UUIDs) or not (date-ordered).
Writes
Step 1 may involve a read I/O; step 5 may involve a write I/O -- but you are not waiting for it to finish.
Index updates
UNIQUE
indexes must be checked before finishing an INSERT
. This involves a potentially-cached read I/O.
For a non-unique index, an entry in the "Change buffer" is made. (This lives in the buffer_pool.) Eventually that is merged with the appropriate block on disk. That is, no waiting for I/O when INSERTing
a row (at least not waiting to update non-unique indexes).
Corollary: UNIQUE
indexes are more costly. But is there really any need for more than 2 such indexes (including the PK)?