I've just added a tagging system to my website and I'm trying to figure out the most efficient way to run scalable queries. Here's a basic working mysql query to return tag matches for a given user:
SELECT
scans.scan_index,
scans.scan_id,
scans.archive_folder
FROM
tags
INNER JOIN
interpretationtags USING (tagid)
INNER JOIN
interpretations USING (interpretation_id)
INNER JOIN
scans
ON scans.scan_id = interpretations.scan_id
AND scans.archive_folder = interpretations.archive_folder
INNER JOIN
archives
ON scans.archive_folder = archives.archive_folder
WHERE
archives.user_id = "google-authd...."
AND tags.tag = "tag1"
But it gets sticky when I want to query multiple tags
for the same scan
. You see, tags
are present in different interpretations,
and there are multiple interpretations for each scan.
Here's a working query for two tags
using a subquery:
SELECT
a.scan_index,
a.scan_id,
a.archive_folder
FROM
(
SELECT
scans.scan_index,
scans.scan_id,
scans.archive_folder
FROM
tags
INNER JOIN
interpretationtags USING (tagid)
INNER JOIN
interpretations USING (interpretation_id)
INNER JOIN
scans
ON scans.scan_id = interpretations.scan_id
AND scans.archive_folder = interpretations.archive_folder
INNER JOIN
archives
ON scans.archive_folder = archives.archive_folder
WHERE
archives.user_id = "google-auth2..."
AND tags.tag = "tag1"
)
as a
INNER JOIN
interpretations
ON a.scan_id = interpretations.scan_id
AND a.archive_folder = interpretations.archive_folder
INNER JOIN
interpretationtags USING(interpretation_id)
INNER JOIN
tags USING(tagid)
WHERE
tags.tag = "tag2"
Since this is running on a LAMP stack, I've written some PHP code to iterate over the tags
I'd like to include in this AND-style search, building a multi-nested query. Here's one with three
SELECT
b.scan_index,
b.scan_id,
b.archive_folder
FROM
(
SELECT
a.scan_index,
a.scan_id,
a.archive_folder
FROM
(
SELECT
scans.scan_index,
scans.scan_id,
scans.archive_folder
FROM
tags
INNER JOIN
interpretationtags USING (tagid)
INNER JOIN
interpretations USING (interpretation_id)
INNER JOIN
scans
ON scans.scan_id = interpretations.scan_id
AND scans.archive_folder = interpretations.archive_folder
INNER JOIN
archives
ON scans.archive_folder = archives.archive_folder
WHERE
archives.user_id = "google..."
AND tags.tag = "tag1"
)
as a
INNER JOIN
interpretations
ON a.scan_id = interpretations.scan_id
AND a.archive_folder = interpretations.archive_folder
INNER JOIN
interpretationtags USING(interpretation_id)
INNER JOIN
tags USING(tagid)
WHERE
tags.tag = "tag2"
)
as b
INNER JOIN
interpretations
ON b.scan_id = interpretations.scan_id
AND b.archive_folder = interpretations.archive_folder
INNER JOIN
interpretationtags USING(interpretation_id)
INNER JOIN
tags USING(tagid)
WHERE
tags.tag = "tag3"
Even 4 nested subqueries runs fast with minimal data, but I just don't see this being a scalable solution when I'm dealing with 100k rows of data. How can I accomplish this without reverting to this ugly inefficient code?
It's hard to be certain without table structures and sample data, but I think you're going about this in the wrong direction. You should start from scans and find all the appropriate tags, and then filter on those (which should then be a simple IN
expression):
SELECT
scans.scan_index,
scans.scan_id,
scans.archive_folder
FROM
scans
INNER JOIN
archives
ON scans.archive_folder = archives.archive_folder
INNER JOIN
interpretations
ON scans.scan_id = interpretations.scan_id
AND scans.archive_folder = interpretations.archive_folder
INNER JOIN
interpretationtags USING (interpretation_id)
INNER JOIN
tags USING (tagid)
WHERE
archives.user_id = "google-authd...."
AND tags.tag IN("tag1", "tag2")
Note that based on your SELECT
field list I don't think you actually need to JOIN
to archives
at all.