mysql sql query-optimization database-performance query-performance

SQL query optimization with 3 joins

I am learning the act of query optimization, in terms of execution time. I have 3 tables which are posts, comments and users. The query below works perfect by returning the Users who have made more comments than posts in the year 2010 and their counts. I believe it can be optimized and I would appreciate an optimized code with an explanation.

Information on each Table

Users table; Has 40325 rows, with indexing on its primary key(Id).
Posts table; Has 91986 rows, with indexing on its primary key(PostId) and also the OwnerUserId which is the foreign key that references the users table.
Comments table - Has 174305 rows, with indexing on its primary key, UserId(for users table) and PostId(for posts table).

SELECT pos_table.user_ID, pos_table.Username, comms, pos from
    (SELECT
    users.Id as 'user_ID', users.DisplayName as 'Username', count(posts.Id) as pos
    FROM
    users
    INNER JOIN posts ON posts.OwnerUserId = users.Id
    WHERE YEAR(posts.CreationDate) = 2010
    group by users.Id
    ) pos_table
    JOIN
    (SELECT
    users.Id as 'user_ID', users.DisplayName as 'Username', count(comments.Id) as
    comms
    FROM
    users
    INNER JOIN comments ON comments.UserId = users.Id
    WHERE YEAR(comments.CreationDate) = 2010
    group by users.Id
    ) comms_table
    on pos_table.user_ID = comms_table.user_ID
    HAVING comms > pos
    order by user_ID
    limit 50;

In my query above, there are two sub-queries. One for posts and the other for comments. I'd like to see how it can be optimized to have a lesser execution time.

The result of the above query, and an attachment of my EXPLAIN query:

Solution

One thing jumps out at me. Both your subqueries have this sort of line in them.

          WHERE YEAR(posts.CreationDate) = 2010

You call a function on a column value. That's not sargeable. It prevents MySQL from being able to exploit an index on that column, and instead requires a full scan. (MySQL, and other DBMSs, are still too dumb to know that YEAR(timestamp) could be satisfied with an index range scan.)

So change those WHEREs to this kind of thing.

          WHERE posts.CreationDate >= '2010-01-01'
            AND posts.CreationDate <  '2010-01-01' + INTERVAL 1 YEAR

and put an index on the CreationDate columns in your posts and comments tables. Then the query planner can random-seek in the index to the first matching row, and then read it sequentially until the last matching row. This is called an index range scan, and is much more efficient than a full table scan.

EDIT You need the following indexes:

CREATE INDEX date_user ON posts ( CreationDate, OwnerUserId );
CREATE INDEX date_user ON comments ( CreationDate, UserID);

I suggest you refactor your query to make your subqueries, where all the work happens, faster.

These should be the subqueries. They each yield a number of items for each user, in the desired time range.

   SELECT OwnerUserId, COUNT(*) posts
     FROM posts
    WHERE CreationDate >= '2010-01-01'
      AND CreationDate <  '2010-01-01' + INTERVAL 1 YEAR
    GROUP BY OwnerUserId

   SELECT UserId, COUNT(*) comments
     FROM comments
    WHERE CreationDate >= '2010-01-01'
      AND CreationDate <  '2010-01-01' + INTERVAL 1 YEAR
    GROUP BY UserId

These queries save time by only aggregating (summarizing by group) the minimal amount of data needed to satisfy the query. And, they can be satisfied by doing a fast index range scan on the indexes I suggested.

Then, you can use these subqueries in your main query, picking up users' names from your users table, like this.

SELECT users.Id user_ID, users.Username, c.comments, p.posts
  FROM users
  JOIN (
       SELECT OwnerUserId, COUNT(*) posts
         FROM posts
        WHERE CreationDate >= '2010-01-01'
          AND CreationDate <  '2010-01-01' + INTERVAL 1 YEAR
        GROUP BY OwnerUserId
        ) p ON users.ID = p.OwnerUserId
   JOIN (
       SELECT UserId, COUNT(*) comments
         FROM comments
        WHERE CreationDate >= '2010-01-01'
          AND CreationDate <  '2010-01-01' + INTERVAL 1 YEAR
        GROUP BY UserId
        ) c ON users.ID = c.UserId
  WHERE c.comments > p.posts
  ORDER BY users.ID
  LIMIT 50;

I suspect you'll get a big performance boost if you add the compound indexes I mentioned. You can drop the indexes on CreationDate; they are redundant when you add the compound indexes.

Here's a worthwhile reference https://use-the-index-luke.com/