I'm coding a website in PHP/MySQL and I'd like to implement a similar to stackoverflow tagging engine. I have 3 relevant tables in DB: 1. Items 2. Tags 3. ItemTagMap (maps tags to items, n:n mapping)
Now, on search page I'd like to show distinct list of all tags for entire search result (not just the current page), so that users can "refine" their search by adding/removing tags from that tag list.
The question is that it's a pretty heavy query on the DB and there can be tons of search requests that result in different result sets and thus different tag sets.
Does anyone know how to implement this effectively?
Before we go into premature optimization mode, it may be useful to look into the following query template. If nothing else this could be used as a baseline against which the effectiveness of possible optimizations can be measured.
SELECT T.Tagid, TagInfo.TagName, COUNT(*)
FROM Items I
JOIN Tags TagInfo ON TagInfo.TagId = T.TagId
JOIN ItemTagMap T ON I.ItemId = T.ItemId
--JOIN ItemTagMap T1 ON I.ItemId = T1.ItemId
WHERE I.ItemId IN
(
SELECT ItemId
FROM Items
WHERE -- Some typical initial search criteria
Title LIKE 'Bug Report%' -- Or some fulltext filter instead...
AND ItemDate > '02/22/2008'
AND Status = 'C'
)
--AND T1.TagId = 'MySql'
GROUP BY T.TagId, TagInfo.TagName
ORDER BY COUNT(*) DESC
The subquery is the "driving query", i.e. the one corresponding to the end-user's initial criteria. (see below for details on how this query, required multiple times may fit in an overall optimized flow) Commented is the JOIN on T1 (and possibly T2, T3, when several tags are selected), and, with the WHERE clause, the associated criteria. These are needed when the user selects a particular tag, whether as part of the initial search or by refinement. (It may be more efficient to place these joins and where clauses within the sub-query; more on these below)
Discussion... The "driving query", or a variation thereof is needed for two distinct purposes:
Note that the complete list doesn't need to be sorted (or it may benefit from sorting in a different order), whereby the second list needs to be sorted based on the user's choice (say by Date, descending or by Title, alphabetically ascending). Also note that if there is any sort order required, the cost of the query will imply dealing with the complete list (shy of odd optimization by SQL itself, and/or some denormalization, SQL needs to "see" the last records on that list, in case they belong to the top, sort-wise).
This latter fact, is in favor of having the very same query for both purposes, the corresponding list can be stored in a temporary table. The general flow would be to quickly lookup the top N Item records with their details and returns this to the application at once. The application can then obtain ajax-fashion the list of Tags for refinements. This list would be produce with a query akin the one above, where the subquery is replaced by a "select * from temporaryTable." The odds are good that the SQL optimizer will decide to sort this list (in some cases), let's let it do that, rather than second guessing it and sorting it explicitly.
One other point to consider is to maybe bring the join(s) on ItemTagMap table inside the "driving query" rather that as shown above. It is probably best to do so, both for performance, and because it will produce the right list for the #2 purpose (display of a page of items).
The query/flow described above will likely scale rather well, even on relatively modest hardware; tentatively into the 1/2 Million+ Items, with sustained user searches maybe up to 10 per second. One of the key factor would be the selectivity of the initial search criteria.
Optimization ideas
-- 'nough said! --
Appropriate architecture and optimizations should be selected in light of the actual requirements and of the effective data statistical profile...