MySQL primary key not being used in query

I have a MySQL query that is apparently not using one of the primary keys, and this is slowing it down.

The tables look like this:

staff_main:
    int staff_ID (the primary key)
    string name

production_role:
    int row_index (primary key, auto-incremented)
    int staff_ID (indexed)
    int production_ID (indexed)
    int role_ID

production_role_episodes:
    int row_index (primary key, autoincremented)
    int match_index (foreign key to production_role.row_index)
    int episode_index (foreign key to episode_info.episode_index)

episode_info:
    int episode_index (primary key)
    int production_ID
    ...other info not used here

And the query looks like this. It's intended to take the index ID of an episode and the ID of a role and look for all staff members who held that role in the specified episode.

SELECT staff_main.staff_ID AS sid,
    staff_main.name AS name   
FROM production_role_episodes      
    JOIN production_role ON (production_role.row_index = production_role_eps.match_index)      
JOIN staff_main USING (staff_ID)     
WHERE production_role_eps.episode_index = {episode}    
    AND production_role.role_ID = {role}     
ORDER BY name

The staff table has ~9000 rows, and this was starting to get slow. An EXPLAIN produced the following:

+----+-------------+---------------------+--------+------------------+----------+---------+----------------------------------------------+------+---------------------------------+
| id | select_type | table               | type   | possible_keys    | key      | key_len | ref                                          | rows | Extra                           |
+----+-------------+---------------------+--------+------------------+----------+---------+----------------------------------------------+------+---------------------------------+
|  1 | SIMPLE      | staff_main          | ALL    | PRIMARY          | NULL     | NULL    | NULL                                         | 9327 | Using temporary; Using filesort |
|  1 | SIMPLE      | production_role     | ref    | PRIMARY,staff_ID | staff_ID | 4       | test_prod_db.staff_main.staff_ID             |    2 | Using where                     |
|  1 | SIMPLE      | production_role_eps | eq_ref | PRIMARY          | PRIMARY  | 8       | test_prod_db.production_role.row_index,const |    1 | Using index                     |
+----+-------------+-------====----------+--------+------------------+----------+---------+----------------------------------------------+------+---------------------------------+

It is apparently not using staff_main.staff_ID as a key, despite this being a primary key. I tried to force it by adding a USE INDEX(PRIMARY) to the staff_main JOIN, but according to EXPLAIN, it's still not using the key. I tried rearranging the JOINs, I tried replacing the USING (staff_ID) with ON (production_role.staff_ID = staff_main.staff_ID), no dice.

Can anyone tell me what's going on? staff_main isn't going to get any smaller, so this query's just going to lag more and more if I can't reel in that index.

Solution

The optimizer is telling MySQL that it will be more advantageous to run a full table scan on the staff table, and retrieve the remaining information, rather than running a scan on episode index and role ID and join staff later.

You can drop a hint that a table scan is very costly, to exclude table scans. But chances are that the optimizer is right and running the query in the other direction will cost you more.

It seems to me that you need these two indexes (role_ID is not indexed in your description), with this exact structure:

CREATE INDEX production_role_ndx ON production_role(role_ID, row_index, staff_ID);

CREATE INDEX production_role_eps_ndx ON production_role_episodes(episode_index, match_index);

You don't seem to need as much, for this query (but maybe for others?), these others:

int staff_ID (indexed)
int production_ID (indexed)

Lengthier explanation

Your query (shortened) is:

SELECT staff_ID, name
FROM pre
JOIN pr ON (pr.row_index = pre.match_index)
JOIN sm ON (sm.staff_ID = pr.staff_ID)
WHERE pre.episode_index = {episode}    
AND pr.role_ID = {role}
ORDER BY name

So, what does it need? And from where it is more convenient to start?

The data comes from two places: the indexes (getting them is fast) and the tables (getting them is slow).

We want to minimize the number of tuples retrieved, but that number is an estimate based on the geometry of the JOIN. Then, we want to retrieve more information possible from the indexes, and not retrieve redundant information.

The query above requires:

sm.staff_ID, sm name                                      for the SELECT
pr.row_index, pre.match_index, sm.staff_ID, pr.staff_ID   for the JOIN
pre.episode_index, pr.role_ID                             for the WHERE

To run optimally the query, we need to reduce the data as soon as possible, so we need to know whether the episode index or the role ID cardinality is greater. Chances are that roles are few and episodes are many, which means that restricting to one episode out of 1000 will reduce our data by 1/1000, while filtering on the role will reduce by maybe 1/20.

So we run the query with the WHERE on pre.episode_index alone. And we need an index on pre that has, as first field, episode_index. Pre is our main table.

Then we join pr. We had a filter on pr.role_ID too. How we find the rows of pr?

pr.row_index = pre.match_index
pr.role_ID = {role}

JOIN pr ON (pr.row_index = pre.match_index AND pr_role_ID = {role})

So we want to index pr on row_index first, because it's driven from the first table, and role_ID second, to immediately further restrict the work. We have not yet accessed either of the two tables: we have only checked the index.

If we add a third column with the staff ID to the pr index, the data we're going to need next, i.e. staff_ID, will be all contained in the index, which becomes what's known as a covering index - and we won't need the table pr at all either. You should see in the EXPLAIN something like "using JOIN buffer", which means that the join happens piecemeal in optimized "bursts".

Of course the estimate that EXPLAIN does will still be based on the number of rows of the first WHERE, so it'll be about the average number of episode rows multiplied by the average number of roles. That's the worst case estimate: you well know that some combinations of episode and role might actually return nothing. So, you shouldn't let a huge estimate worry you.

At this point we have staff_main and the query supplies staff_ID as its primary key, so we need do nothing: just join staff_main. For good measure in the select specify that staff_ID comes from pr, and not staff_main. The value is the same and it probably changes nothing, but a ready access to pr.staff_ID is guaranteed and easy (we have it in a covering index), and we don't want to confuse the optimizer, just in case.