Search code examples
mysqljoininner-jointemp-tables

Joining 2 MySQL temporary tables is 50x slower than joining a temporary table with a normal table and adding a WHERE clause?


I have 2 groups that I'm trying to find the intersection of (requiring 2 columns to match up), and I'm finding that joining 2 temporary tables yields a 50x slower performance than just joining against the raw table with one temporary table. This makes no sense to me, so perhaps someone can enlighten me?

Here's how I wrote the 2 temporary tables version:

CREATE TEMPORARY TABLE attendees (
    event_id SMALLINT(5) UNSIGNED,
    person_id INT(10) UNSIGNED NOT NULL,
    KEY(event_id),
    KEY(person_id)
);
INSERT INTO attendees (event_id, person_id) 
    SELECT event_id, person_id
    FROM attendance WHERE year=2013
    GROUP BY event_id, person_id;
CREATE TEMPORARY TABLE invitees (
    event_id SMALLINT(5) UNSIGNED,
    person_id INT(10) UNSIGNED NOT NULL,
    KEY(event_id),
    KEY(person_id)
);
INSERT INTO invitees (event_id, person_id)
    SELECT event_id, person_id
    FROM invitations WHERE year=2013
    GROUP BY event_id, person_id;
SELECT i.event_id, COUNT(DISTINCT i.person_id)
    FROM attendees AS a
    INNER JOIN invitees AS i
        ON a.person_id = i.person_id AND a.event_id = i.event_id
    GROUP BY i.event_id;

There are less than 2,000 rows in each of these 2 temporary tables, yet this final query takes ~2.5 seconds on my laptop. I don't understand how that can be possible.

On the other hand, with the following implementation, the final query only takes 0.05 seconds, even though it's hitting the full invitations table (~100,000 rows):

CREATE TEMPORARY TABLE attendees (
    event_id SMALLINT(5) UNSIGNED,
    person_id INT(10) UNSIGNED NOT NULL,
    KEY(event_id),
    KEY(person_id)
);
INSERT INTO attendees (event_id, person_id) 
    SELECT event_id, person_id
    FROM attendance WHERE year=2013
    GROUP BY event_id, person_id;
SELECT i.event_id, COUNT(DISTINCT i.person_id)
    FROM attendees AS a
    INNER JOIN invitations AS i
        ON a.person_id = i.person_id AND a.event_id = i.event_id
    WHERE i.year=2013
    GROUP BY i.event_id;

For what it's worth, both the original tables (attendance and invitations) have indexes on event_id, person_id, and year. And the reason I'm doing such complicated code in the first place is that there are attendees to events that were not invited, and I have to calculate the count of every part of the venn diagram of these sets of people (attended and invited, attended not invited, invited not attended, and neither).

I guess my question is, what's going on here to make the second version so much faster?

In case it matters, my server version is 5.5.36 MySQL Community Server (5.6 had several strange behaviors that broke my site).


Solution

  • A join can only make use of one index per table. Instead of separate indexes on person_id and event_id, give the tables composite indexes on both:

    CREATE TEMPORARY TABLE attendees (
        event_id SMALLINT(5) UNSIGNED,
        person_id INT(10) UNSIGNED NOT NULL,
        KEY(event_id, person_id)
    );
    
    CREATE TEMPORARY TABLE invitees (
        event_id SMALLINT(5) UNSIGNED,
        person_id INT(10) UNSIGNED NOT NULL,
        KEY(event_id, person_id)
    );
    

    I suspect the original attendance table has an index like this, which makes the join with that table much faster.