Search code examples
innodbb-tree

how to predict the IO count of mysql query?


As InnoDB organizes its data in B+ trees. The height of the tree affects the count of IO times which may be one of the main reasons that DB slows down.

So my question is how to predicate or calculate the height of the B+ tree (e.g. based on the count of pages which can be calculated by row size, page size, and row number), and thus to make a decision whether or not to partition the data to different masters.


Solution

  • Simpler computation

    The quick and dirty answer is log base 100, rounded up. That is, each node in the BTree has about 100 leaf nodes. In some circles, this is called fanout.

    1K rows: 2 levels 1M rows: 3 levels billion: 5 levels trillion: 6 levels

    These numbers work for "average" rows or indexes. Of course, you could have extremes of about 2 or 1000 for the fanout.

    Exact depth

    You can find the actual depth from some information_schema:

    For Oracle's MySQL:

    $where = "WHERE ( ( database_name = ? AND table_name = ? )
              OR      ( database_name = LOWER(?) AND table_name = LOWER(?) )  )";
    $sql = "SELECT  last_update,
                    n_rows,
                    'Data & PK' AS 'Type',
                    clustered_index_size * 16384 AS Bytes,
                    ROUND(clustered_index_size * 16384 / n_rows) AS 'Bytes/row',
                    clustered_index_size AS Pages,
                    ROUND(n_rows / clustered_index_size) AS 'Rows/page'
            FROM mysql.innodb_table_stats
            $where
        UNION
            SELECT  last_update,
                    n_rows,
                    'Secondary Indexes' AS 'BTrees',
                    sum_of_other_index_sizes * 16384 AS Bytes,
                    ROUND(sum_of_other_index_sizes * 16384 / n_rows) AS 'Bytes/row',
                    sum_of_other_index_sizes AS Pages,
                    ROUND(n_rows / sum_of_other_index_sizes) AS 'Rows/page'
            FROM mysql.innodb_table_stats
            $where
              AND sum_of_other_index_sizes > 0
        ";
    

    For Percona:

    /* to enable stats:
        percona < 5.5: set global userstat_running = 1;
                  5.5: set global userstat = 1; */
    $sql = "SELECT  i.INDEX_NAME as Index_Name,
                    IF(ROWS_READ IS NULL, 'Unused',
                        IF(ROWS_READ > 2e9, 'Overflow', ROWS_READ)) as Rows_Read
                FROM (
                    SELECT DISTINCT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME
                        FROM information_schema.STATISTICS
                     ) i
                LEFT JOIN information_schema.INDEX_STATISTICS s
                         ON i.TABLE_SCHEMA = s.TABLE_SCHEMA
                        AND i.TABLE_NAME = s.TABLE_NAME
                        AND i.INDEX_NAME = s.INDEX_NAME
                WHERE i.TABLE_SCHEMA = ?
                  AND i.TABLE_NAME = ?
                ORDER BY IF(i.INDEX_NAME = 'PRIMARY', 0, 1), i.INDEX_NAME";
    

    (Those give more than just the depth.)

    PRIMARY refers to the data's BTree. Names like "n_diff_pfx03" refers to the 3rd level of the BTree; the largest such number for a table indicates the total depth.

    Row width

    As for estimating the width of a row, see Bill's answer. Here's another approach:

    1. Look up the size of each column (INT=4 bytes, use averages for VARs)
    2. Sum those.
    3. Multiply by between 2 and 3 (to allow for overhead of InnoDB)
    4. Divide into 16K to get average number of leaf nodes.

    Non-leaf nodes, plus index leaf nodes, are trickier because you need to understand exactly what represents a "row" in such nodes.

    (Hence, my simplistic "100 rows per node".)

    But who cares?

    Here's another simplification that seems to work quite well. Since disk hits are the biggest performance item in queries, you need to "count the disk hits" as the first order of judging the performance of a query.

    But look at the caching of blocks in the buffer_pool. A parent node is 100 times as likely to be recently touched as the child node.

    So, the simplification is to "assume" that all non-leaf nodes are cached and all leaf nodes need to be fetched from disk. Hence the depth is not nearly as important as how many leaf node blocks are touched. This shoots down your "35% speedup" -- Sure 35% speedup for CPU, but virtually no speedup for I/O. And I/O is the important component.

    Note that if you fetching the latest 20 rows of a table that is chronologically stored, they will be found in the last 1 (or maybe 2) blocks. If they are stored by a UUID, it is more likely to tale 20 blocks -- many more disk hits, hence much slower.

    Secondary Indexes

    The PRIMARY KEY is clustered with the data. That implies that a look by the PK needs to drill down one BTree. But a secondary index is implemented by a second BTree -- drill down it to find the PK, then drill down via the PK. When "counting the disk hits", you need to consider both BTrees. And consider the randomness (eg, for UUIDs) or not (date-ordered).

    Writes

    1. Find the block (possibly cached)
    2. Update it
    3. If necessary, deal with a block split
    4. Flag the block as "dirty" in the buffer_pool
    5. Eventually write it back to disk.

    Step 1 may involve a read I/O; step 5 may involve a write I/O -- but you are not waiting for it to finish.

    Index updates

    UNIQUE indexes must be checked before finishing an INSERT. This involves a potentially-cached read I/O.

    For a non-unique index, an entry in the "Change buffer" is made. (This lives in the buffer_pool.) Eventually that is merged with the appropriate block on disk. That is, no waiting for I/O when INSERTing a row (at least not waiting to update non-unique indexes).

    Corollary: UNIQUE indexes are more costly. But is there really any need for more than 2 such indexes (including the PK)?