Search code examples
postgresqljoinpostgresql-9.3postgresql-performancesql-optimization

Optimizing SQL query with multiple joins and grouping (Postgres 9.3)


I've browsed around some other posts and managed to make my queries run a bit faster. However, I've come to a loss as to how to further optimize this query. I'm going to be using it on a website where it will execute the query when the page is loaded, but 5.5 seconds is far too long to wait for something that should be a lot more simple. The largest table has around 4,000,000 rows and the other ones are around 400,000 each.

Table Structure

match

id BIGINT PRIMARY KEY,
region TEXT,
matchType TEXT,
matchVersion TEXT

team

matchid BIGINT REFERENCES match(id),
id INTEGER,
PRIMARY KEY(matchid, id),
winner TEXT

champion

id INTEGER PRIMARY KEY,
version TEXT,
name TEXT

item

id INTEGER PRIMARY KEY,
name TEXT

participant

PRIMARY KEY(matchid, id),
id INTEGER NOT NULL,
matchid BIGINT REFERENCES match(id),
championid INTEGER REFERENCES champion(id),
teamid INTEGER,
FOREIGN KEY (matchid, teamid) REFERENCES team(matchid, id),
magicDamageDealtToChampions REAL,
damageDealtToChampions REAL,
item0 TEXT,
item1 TEXT,
item2 TEXT,
item3 TEXT,
item4 TEXT,
item5 TEXT,
highestAchievedSeasonTier TEXT

Query

select champion.name,
sum(case when participant.item0 = '3285' then 1::int8 else 0::int8 end) as it0,
sum(case when participant.item1 = '3285' then 1::int8 else 0::int8 end) as it1,
sum(case when participant.item2 = '3285' then 1::int8 else 0::int8 end) as it2,
sum(case when participant.item3 = '3285' then 1::int8 else 0::int8 end) as it3,
sum(case when participant.item4 = '3285' then 1::int8 else 0::int8 end) as it4,
sum(case when participant.item5 = '3285' then 1::int8 else 0::int8 end) as it5
from participant
left join champion
on champion.id = participant.championid
left join team
on team.matchid = participant.matchid and team.id = participant.teamid
left join match
on match.id = participant.matchid
where (team.winner = 'True' and matchversion = '5.14'  and matchtype='RANKED_SOLO_5x5')
group by champion.name;

Output of EXPLAIN ANALYZE: http://explain.depesz.com/s/ZYX

What I've done so far

I've created separate indexes on match.region, participant.championid, and a partial index on team where winner = 'True' (since that is only what I am interested in). Note that enable_seqscan = on since when it's off the query is extremely slow. Essentially, the result I'm trying to get is something like this:

Champion   |item0 | item1 | ... | item5
champ_name | num  |  num1 | ... | num5
...

Since I'm still a beginner with respect to database design, I wouldn't be surprised if there is a flaw in my overall table design. I'm still leaning towards the query being absolutely inefficient, though. I've played with both inner joins and left joins -- there is no significant difference though. Additionally, match needs to be bigint (or something larger than integer, since it's too small).


Solution

  • Database design

    I suggest:

    CREATE TABLE matchversion (
      matchversion_id int PRIMARY KEY
    , matchversion    text UNIQUE NOT NULL
    );
    
    CREATE TABLE matchtype (
      matchtype_id int PRIMARY KEY
    , matchtype    text UNIQUE NOT NULL
    );
    
    CREATE TABLE region (
      region_id int PRIMARY KEY
    , region    text NOT NULL
    );
    
    CREATE TABLE match (
      match_id        bigint PRIMARY KEY
    , region_id       int REFERENCES region
    , matchtype_id    int REFERENCES matchtype
    , matchversion_id int REFERENCES matchversion
    );
    
    CREATE TABLE team (
      match_id bigint REFERENCES match
    , team_id  integer  -- better name !
    , winner   boolean  -- ?!
    , PRIMARY KEY(match_id, team_id)
    );
    
    CREATE TABLE champion (
      champion_id int PRIMARY KEY
    , version     text
    , name        text
    );
    
    CREATE TABLE participant (
      participant_id serial PRIMARY KEY -- use proper name !
    , champion_id    int NOT NULL REFERENCES champion
    , match_id       bigint NOT NULL REFERENCES match -- this FK might be redundant
    , team_id        int
    , magic_damage_dealt_to_champions real
    , damage_dealt_to_champions       real
    , item0      text  -- or integer ??
    , item1      text
    , item2      text
    , item3      text
    , item4      text
    , item5      text
    , highest_achieved_season_tier text  -- integer ??
    , FOREIGN KEY (match_id, team_id) REFERENCES team
    );
    
    • More normalization in order to get smaller tables and indexes and faster access. Create lookup-tables for matchversion, matchtype and region and only write a small integer ID in match.

    • Seems like the columns participant.item0 .. item5 and highestAchievedSeasonTier could be integer, but are defined as text?

    • The column team.winner seems to be boolean, but is defined as text.

    • I also changed the order of columns to be more efficient. Details:

    Query

    Building on above modifications and for Postgres 9.3:

    SELECT c.name, *
    FROM  (
       SELECT p.champion_id
            , count(p.item0 = '3285' OR NULL) AS it0
            , count(p.item1 = '3285' OR NULL) AS it1
            , count(p.item2 = '3285' OR NULL) AS it2
            , count(p.item3 = '3285' OR NULL) AS it3
            , count(p.item4 = '3285' OR NULL) AS it4
            , count(p.item5 = '3285' OR NULL) AS it5
       FROM   matchversion   mv  
       CROSS  JOIN matchtype mt
       JOIN   match          m  USING (matchtype_id, matchversion_id)
       JOIN   team           t  USING (match_id)
       JOIN   participant    p  USING (match_id, team_id)
       WHERE  mv.matchversion = '5.14'
       AND    mt.matchtype = 'RANKED_SOLO_5x5'
       AND    t.winner = 'True' -- should be boolean
       GROUP  BY p.champion_id
       ) p
    JOIN  champion c USING (champion_id);  -- probably just JOIN ?
    
    • Since champion.name is not defined UNIQUE, it's probably wrong to GROUP BY it. It's also inefficient. Use participant.championid instead (and join to champion later if you need the name in the result).

    • All instances of LEFT JOIN are pointless, since you have predicates on the left tables anyway and / or use the column in GROUP BY.

    • Parentheses around AND-ed WHERE conditions are not needed.

    • In Postgres 9.4 or later you could use the new aggregate FILTER syntax instead. Details and alternatives:

    Index

    The partial index on team you already have should look like this to allow index-only scans:

    CREATE INDEX on team (matchid, id) WHERE winner -- boolean
    

    But from what I see, you might just add a winner column to participant and drop the table team completely (unless there is more to it).

    Also, that index is not going to help much, because (telling from your query plan) the table has 800k rows, half of which qualify:

    rows=399999 ... Filter: (winner = 'True'::text) ... Rows Removed by Filter: 399999
    

    This index on match will help a little more (later) when you have more different matchtypes and matchversions:

    CREATE INDEX on match (matchtype_id, matchversion_id, match_id);
    

    Still, while 100k rows qualify out of 400k, the index is only useful for an index only scan. Otherwise, a sequential scan will be faster. An index typically pays for about selecting 5 % of the table or less.

    Your main problem is that you are obviously running a test case with hardly realistic data distribution. With more selective predicates, indexes will be used more readily.

    Aside

    Make sure you have configured basic Postgres settings like random_page_cost or work_mem etc.

    enable_seqscan = on goes without saying. This is only turned off for debugging or locally as a desperate measure of last resort.