Search code examples
mysqloptimizationsubqueryquery-optimization

Query optimization in MYSQL for a query with sub queries?


In an assignment, I was given a query and asked to optimize it. The query is:

SELECT
    C.yearID as year,
    name as teamName,
    C.lgID as league,
    D.cnt as totalBatters,
    C.cnt as aboveAverageBatters
FROM
    (SELECT 
        count(masterID) as cnt, A.yearID, A.teamID, A.lgID
    FROM
        (select 
        masterID,
            teamID,
            yearID,
            lgID,
            sum(AB),
            sum(H),
            sum(H) / sum(AB) as avg
    FROM
        batting
    GROUP BY teamID , yearID , lgID , masterID) B, (select 
        teamID,
            yearID,
            lgID,
            sum(AB),
            sum(H),
            sum(H) / sum(AB) as avg
    FROM
        batting
    WHERE ab is not null
    GROUP BY teamID , yearID , lgID) A
    WHERE
        A.avg >= B.avg AND A.teamID = B.teamID
            AND A.yearID = B.yearID
            AND A.lgID = B.lgID
    GROUP BY teamID , yearID , lgID) C,
    (SELECT 
        count(masterID) as cnt, yearID, teamID, lgID
    FROM
        batting
    WHERE ab is not null
    GROUP BY yearID , teamID , lgID) D, 
    teams
WHERE
    C.cnt / D.cnt >= 0.75
        AND C.yearID = D.yearID
        AND C.teamID = D.teamID
        AND C.lgID = D.lgID
        AND teams.yearID = C.yearID
        AND teams.lgID = C.lgID
        AND teams.teamID = C.teamID

I was wondering what can be done to optimize this? I am new to this concept and a little confused on how to proceed. In General, how to optimize sub queries which have select statements in it?


Solution

  • In General, how to optimize sub queries which have select statements in it?

    Here are some ideas to get you started. I will respect the fact that it is an assignment and in the end, you will gain much better understanding of the SQL queries by going through it yourself and learning along the way.

    I would hope that the assignment included a sort of data set that you could import into MySQL so that you can run the query as you make changes and notice the impact on the execution plan and overall performance.


    Aliases

    Before even thinking of optimizing, perhaps you can look at how to make the code easier to read, understand and maintain. The subqueries behave in a manner similar to regular tables, and as such, they should be given aliases/names that make sense as to what the data set means.

    They are aliased B, A, C and D, which look almost intentionally named to obfuscate, but actually you would be surprised how often you see poor naming/aliasing in real-life, production SQL code.

    Try looking (and if you can, running) each subquery on its own, look at the fields and what they mean, then replace the alias with a good name, and update the references in the different columns accordingly. This will optimize the query for better clarity, and ultimately, maintainability.


    JOINs

    Hopefully by the time of this assignment, the various types of JOIN operations have already been covered. If not, here is a good summary from a StackOverflow answer. There is a large number of other resources covering the ins and outs of JOIN, including a good post on TechOnTheNet.

    Let's strip down the subqueries and look at how the query as a whole is structured. I replaced the logic by comments to make it more apparent:

    SELECT
    --columns
    FROM 
    (
        SELECT  
        --columns
        FROM 
        (
            select 
            --columns
            FROM batting
        ) B, 
        (
            select 
            --columns
        ) A
        WHERE
        --some comparisons of averages
    ) C,
    (
        SELECT 
         --columns
        FROM batting
    ) D, 
    teams
    WHERE
        --a filter based on a calculation
        C.cnt / D.cnt >= 0.75
        --um... what is all this stuff doing down here?
        --shouldn't those be in a JOIN?
            AND C.yearID = D.yearID
            AND C.teamID = D.teamID
            AND C.lgID = D.lgID
            AND teams.yearID = C.yearID
            AND teams.lgID = C.lgID
            AND teams.teamID = C.teamID
    

    Do you notice anything peculiar or anything that looks strange? If you have not read this before, I strongly recommend reading Bad habits to kick : using old-style JOINs by Aaron Bertrand.

    After reading it, look at the skeleton of this query again, and the improvements you can make by using modern JOIN should stand out. This will make the query that much more optimal in clarity and maintainability.


    Consistency in keywords case

    Another way you can make it more readable is by using consistent capitalization of keywords. As it is, it's about 50/50 using CAPITAL CASE and small case. It may seem insignificant for just one script, but when this inconsistency is spread over a whole code base, it can get really irritating for the next person who will have to develop in it and maintain it.


    Performance

    So, by now if you've applied everything so far, the code should be a lot easier to decipher. As far as performance is concerned, there are two things that stand out to me as detrimental. There are lots of aggregates, and consequently, lots of GROUP BYs.

    Look at each subquery by itself first, and at each aggregation. Look at how each of the fields is used in the context of the whole query. See which ones you might be able to remove, that perhaps the person who wrote the query originally thought they were going to need, but ended up not using and forgot to remove them.

    Try the same tactic for the GROUP BY fields, which are every field you are not aggregating that is included in a query that has one or more aggregate operations. GROUP BY can get pretty expensive, and that is compounded by the fact the derived subqueries also have GROUP BY.


    There are a few other tricks you can try, that are more advanced and may improve execution at the compromise of I/O, for example extracting one or more subqueries' result sets into temporary tables, which would free up locks on the primary tables.

    Optimizations like these might not always necessarily improve the execution speed per se, but in a production environment with database servers under load, often speed is not the primary concern for optimization, but rather "lightness" (or as small of a load footprint on the servers as possible) is often much more valuable than raw speed which ends up using more resources.


    I hope this helps!