Search code examples
mysqlindexingamazon-rdsparquetcardinality

Why would cardinality of an index in a restored table be different from cardinality in the original table?


I'm testing a proprietary tool that dumps a table in a MySQL RDS to the parquet format, and then restores it into another MySQL RDS.

Both tables have the same amount of rows:

mysql> SELECT COUNT(*) FROM fox_owners;
+----------+
| COUNT(*) |
+----------+
|   118950 |
+----------+

The table itself is configured the same way in both cases:

mysql> SHOW CREATE TABLE fox_owners;
+------------+-------------------------------------------------------+
| Table      | Create Table                                          |
+------------+-------------------------------------------------------+
| fox_owners | CREATE TABLE `fox_owners` (
  `name` mediumtext,
  `owner_id` bigint NOT NULL,
  PRIMARY KEY (`owner_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+------------+-------------------------------------------------------+

So far so good, right?

However, the table sizes are different. The original:

+----------+----------------------+------------+
| Database | Table                | Size in MB |
+----------+----------------------+------------+
| stam_db  | fox_owners           |    5582.52 |

The restored one:

+----------+----------------------+------------+
| Database | Table                | Size in MB |
+----------+----------------------+------------+
| stam_db  | fox_owners           |    5584.52 |

The restored one is 2MB bigger! However, what's really bugging me, is the change in cardinality of the indexes between the 2 tables. Original:

mysql> SHOW INDEX FROM fox_owners;
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table      | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| fox_owners |          0 | PRIMARY  |            1 | owner_id    | A         |      118728 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

Restored:

mysql> SHOW INDEX FROM fox_owners;
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table      | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| fox_owners |          0 | PRIMARY  |            1 | owner_id    | A         |      117518 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

Why would cardinality drop from 118728 to 117518? If the restored table is less unique than the original - isn't this a clear sign that this table is different? How can I verify that these 2 tables in separate RDS databases have identical content?

And shouldn't the cardinality be 118950 in both of them anyway, since for a table with a single primary key column, the cardinality must be equal to the number of rows in the table?

I've ran ANALYZE TABLE on both tables, the values didn't change.


Solution

  • No Problem.

    The "cardinality" is determined by making a small number of 'random' probes into the table. This leads to estimates. Sometimes the estimates are off by a factor of two or even more. 118728 and 117518 are unusually close to each other.

    When loading/copying/altering a table, the BTrees are rebuilt. This leads to a likely variation of how the blocks of the BTree are laid out. So, it is normal to see the size (on disk) of a table change. A change of a factor of 2 is rare for this.