I have to retrieve a list of postings, by one or more category IDs. I do not want to have duplicate postings in my results.
I'm only interested in responses that directly relate, or can draw relations with MySQL 8
There are two queries I'm considering, and I'm deciding which one is better. Or, if there is a better "3rd query", please advise.
Consider a simple two table structure:
CREATE TABLE `job_category_posting` (
`category_posting_id` int UNSIGNED NOT NULL,
`category_posting_category_id` int UNSIGNED NOT NULL,
`category_posting_posting_id` int UNSIGNED NOT NULL,
`category_posting_is_primary_category` tinyint UNSIGNED DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
CREATE TABLE `job_posting` (
`posting_id` int UNSIGNED NOT NULL,
`posting_title` varchar(250) NOT NULL,
`posting_body` mediumtext CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
ALTER TABLE `job_category_posting`
ADD PRIMARY KEY (`category_posting_id`),
ADD UNIQUE KEY `category_posting_category_id` (`category_posting_category_id`,`category_posting_posting_id`),
ADD UNIQUE KEY `category_posting_is_primary_category` (`category_posting_is_primary_category`,`category_posting_posting_id`),
ADD KEY `category_posting_posting_id` (`category_posting_posting_id`) USING BTREE;
ALTER TABLE `job_posting`
ADD PRIMARY KEY (`posting_id`),
ADD UNIQUE KEY `posting_reserve_id` (`posting_reserve_id`),
ADD KEY `posting_title` (`posting_title`);
1st query (SUBQUERY with GROUP BY):
SELECT t1.*
FROM job_posting AS t1
WHERE (t1.posting_id) IN(
SELECT category_posting_posting_id
FROM job_category_posting
WHERE category_posting_category_id IN (2,13,22,23,24,25)
GROUP BY category_posting_posting_id
)
Quick dirty speed tests (doesn't tell me much):
What I noticed:
2nd query (INNER JOIN with GROUP BY):
SELECT job_posting.*
FROM job_category_posting
inner join job_posting on job_category_posting.category_posting_posting_id = job_posting.posting_id
WHERE category_posting_category_id IN (2,13,22,23,24,25)
GROUP BY category_posting_posting_id
Quick dirty speed tests (doesn't tell me much):
What I noticed:
So my question is, which is better? Is there a better explanation that can justify it? I just need some solid facts and proof.
Or is there a third query I should try?
Any advice is appreciated!
A few things:
You have appropriate indexes for both your queries.
Execution plans often change as tables grow. The work you do to save a few hundred microseconds on a small table is not necessarily useful for a larger table. You'll probably need to revisit execution plans as your tables grow.
"Using temporary" doesn't mean your query uses a full-blown on-disk temporary table. It simply means the software accumulates a result set into a temporary data structure prior to, in your case, deduplicating it. (Only if that temporary data structure is too big for RAM does the software use on-disk structures. Yours certainly fits in RAM.) Don't be fooled by the inaccurate, but time-honored, language in the execution plan's extra
column. "Using temporary" is OK.
The value IN (set of values)
predicate automatically deduplicates the set of values. So your first query can be rewritten without the GROUP BY thusly.
SELECT t1.*
FROM job_posting AS t1
WHERE t1.posting_id IN (
SELECT category_posting_posting_id
FROM job_category_posting
WHERE category_posting_category_id IN (2,13,22,23,24,25)
)
This is the query I would use, because (in my opinion) it expresses your intent most clearly. And, I suspect it scales up better to large tables because it does the deduplication work on the set of posting_id
values alone, not whole rows.
Your second query misuses MySQL's notorious nonstandard extension to GROUP BY. Disable that extension with SET sql_mode = CONCAT_WS(',',@@sql_mode, 'ONLY_FULL_GROUP_BY')
, then try your query again. You'll need more terms in your GROUP BY clause. Better yet, get rid of the GROUP BY and use DISTINCT, like this.
SELECT DISTINCT job_posting.*
FROM job_category_posting
inner join job_posting
on job_category_posting.category_posting_posting_id = job_posting.posting_id
WHERE category_posting_category_id IN (2,13,22,23,24,25)
But this has to deduplicate whole rows.