I am learning the act of query optimization, in terms of execution time. I have 3 tables which are posts, comments and users. The query below works perfect by returning the Users who have made more comments than posts in the year 2010 and their counts. I believe it can be optimized and I would appreciate an optimized code with an explanation.
Information on each Table
SELECT pos_table.user_ID, pos_table.Username, comms, pos from
(SELECT
users.Id as 'user_ID', users.DisplayName as 'Username', count(posts.Id) as pos
FROM
users
INNER JOIN posts ON posts.OwnerUserId = users.Id
WHERE YEAR(posts.CreationDate) = 2010
group by users.Id
) pos_table
JOIN
(SELECT
users.Id as 'user_ID', users.DisplayName as 'Username', count(comments.Id) as
comms
FROM
users
INNER JOIN comments ON comments.UserId = users.Id
WHERE YEAR(comments.CreationDate) = 2010
group by users.Id
) comms_table
on pos_table.user_ID = comms_table.user_ID
HAVING comms > pos
order by user_ID
limit 50;
In my query above, there are two sub-queries. One for posts and the other for comments. I'd like to see how it can be optimized to have a lesser execution time.
The result of the above query, and an attachment of my EXPLAIN query:
One thing jumps out at me. Both your subqueries have this sort of line in them.
WHERE YEAR(posts.CreationDate) = 2010
You call a function on a column value. That's not sargeable. It prevents MySQL from being able to exploit an index on that column, and instead requires a full scan. (MySQL, and other DBMSs, are still too dumb to know that YEAR(timestamp)
could be satisfied with an index range scan.)
So change those WHEREs to this kind of thing.
WHERE posts.CreationDate >= '2010-01-01'
AND posts.CreationDate < '2010-01-01' + INTERVAL 1 YEAR
and put an index on the CreationDate columns in your posts and comments tables. Then the query planner can random-seek in the index to the first matching row, and then read it sequentially until the last matching row. This is called an index range scan, and is much more efficient than a full table scan.
EDIT You need the following indexes:
CREATE INDEX date_user ON posts ( CreationDate, OwnerUserId );
CREATE INDEX date_user ON comments ( CreationDate, UserID);
I suggest you refactor your query to make your subqueries, where all the work happens, faster.
These should be the subqueries. They each yield a number of items for each user, in the desired time range.
SELECT OwnerUserId, COUNT(*) posts
FROM posts
WHERE CreationDate >= '2010-01-01'
AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR
GROUP BY OwnerUserId
SELECT UserId, COUNT(*) comments
FROM comments
WHERE CreationDate >= '2010-01-01'
AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR
GROUP BY UserId
These queries save time by only aggregating (summarizing by group) the minimal amount of data needed to satisfy the query. And, they can be satisfied by doing a fast index range scan on the indexes I suggested.
Then, you can use these subqueries in your main query, picking up users' names from your users
table, like this.
SELECT users.Id user_ID, users.Username, c.comments, p.posts
FROM users
JOIN (
SELECT OwnerUserId, COUNT(*) posts
FROM posts
WHERE CreationDate >= '2010-01-01'
AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR
GROUP BY OwnerUserId
) p ON users.ID = p.OwnerUserId
JOIN (
SELECT UserId, COUNT(*) comments
FROM comments
WHERE CreationDate >= '2010-01-01'
AND CreationDate < '2010-01-01' + INTERVAL 1 YEAR
GROUP BY UserId
) c ON users.ID = c.UserId
WHERE c.comments > p.posts
ORDER BY users.ID
LIMIT 50;
I suspect you'll get a big performance boost if you add the compound indexes I mentioned. You can drop the indexes on CreationDate; they are redundant when you add the compound indexes.
Here's a worthwhile reference https://use-the-index-luke.com/