Search code examples
mysqlperformancerds

Index on Boolean field to delete records in a partitioned table


I have a large MySQL table which may contain 100 million records. The schema of the table is something like this-

Id varchar(36), --guid,  primary key
IsDirty bit(1),
CreatedOn(Date),
Info varchar(500)

I have created a partition on CreatedOn field which creates a partition for monthly data. Some of the rows in the table are updated and isDirty is set to 1. At max, only 10% of the rows would have IsDirty = 1. There is a process that runs every night and deletes data which is 6 months old with value IsDirty = 0.

Is there any performance gain if I create an index on IsDirty field as well? From what I've read is, creating indexes on bit field may not add much to the performance but reindexing after deleting the records may downgrade the performance due to index.

Is my understanding correct? Is there a better way to achieve the desired functionality?


Solution

  • There is a rule of thumb which says, that it's best to index columns with a high cardinality. Cardinality is the estimated number of distinct values in the column. When you do a show indexes from your_table; you would see, that your IsDirty column has a cardinality of 2. Very bad.

    However this does not consider the distribution of the data. When only 10% have IsDirty = 1, queries like select * from your_table where IsDirty = 1 would benefit from the index. Your delete job on the other hand, which checks for IsDirty = 0 would not benefit, as it's cheaper to simply do a full table scan, because using a secondary index means, that from the index the primary key is read (in every secondary index the primary key is stored, therefore it's always good to make the primary key as small as possible) to identify the row to be read.

    The manual states the following about when a full table scan is prefered:

    Each table index is queried, and the best index is used unless the optimizer believes that it is more efficient to use a table scan. At one time, a scan was used based on whether the best index spanned more than 30% of the table, but a fixed percentage no longer determines the choice between using an index or a scan. The optimizer now is more complex and bases its estimate on additional factors such as table size, number of rows, and I/O block size.

    Also note, that the bit datatype is not ideal to store values 0 or 1. There is a bool datatype (which is internally realised as tinyint(1). I think I've read somewhere a reason for this, but I've forgotten about it).