I've got performance and conceptional problems with getting a query right on SQL Server 7 running on a dual core 2GHz + 2GB RAM machine - no chance of getting that out of the way, as you might expect :-/.
I'm working with a legacy database and I need to mine for data to get various insights. I've got the all_stats
table that contains all the stat data for a thingy in a specific context. These contexts are grouped with the help of the group_contexts
table. A simplified schema:
+--------------------------------------------------------------------+
| thingies |
+--------------------------------------------------------------------|
| id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| all_stats |
+--------------------------------------------------------------------+
| id | INT PRIMARY KEY IDENTITY(1,1) |
| context_id | INT FOREIGN KEY REFERENCES contexts(id) |
| value | FLOAT NULL |
| some_date | DATETIME NOT NULL |
| thingy_id | INT NOT NULL FOREIGN KEY REFERENCES thingies(id) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| group_contexts |
+--------------------------------------------------------------------|
| id | INT PRIMARY KEY IDENTITY(1,1) |
| group_id | INT NOT NULL FOREIGN KEY REFERENCES groups(group_id) |
| context_id | INT NOT NULL FOREIGN KEY REFERENCES contexts(id) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| contexts |
+--------------------------------------------------------------------+
| id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| groups |
+--------------------------------------------------------------------+
| group_id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
The task is, for a given set of thingies, to find and aggregate the 3 most recent (all_stats.some_date
) stats of a thingy for all groups the thingy has stats for. I know it sounds easy but I can't get around how to do this properly in SQL - I'm not exactly a prodigy.
My solution right now is to fill a temporary table with all the required data and UNION ALL
ing the data I need:
-- Before I'm building this SQL I retrieve the relevant groups
-- for being able to build the `UNION ALL`s at the bottom.
-- I also retrieve the thingies that are relevant in this context
-- beforehand and include their ids as a comma separated list -
-- I said it would be awfull ...
-- Creating the temp table holding all stats data rows
-- for a thingy in a specific group
CREATE TABLE #stats
(id INT PRIMARY KEY IDENTITY(1,1),
group_id INT NOT NULL,
thingy_id INT NOT NULL,
value FLOAT NOT NULL,
some_date DATETIME NOT NULL)
-- Filling the temp table
INSERT INTO #stats(group_id,thingy_id,value,some_date)
SELECT filtered.group_id, filtered.thingy_id, filtered.some_date, filtered.value
FROM
(SELECT joined.group_id,joined.thingy_id,joined.value,joined.some_date
FROM
(SELECT groups.group_id,data.value,data.thingy_id,data.some_date
FROM
-- Getting the groups associated with the contexts
-- of all the stats available
(SELECT DISTINCT context.group_id
FROM all_stats AS stat
INNER JOIN group_contexts AS groupcontext
ON groupcontext.context_id = stat.context_id
) AS groups
INNER JOIN
-- Joining the available groups with the actual
-- stat data of the group for a thingy
(SELECT context.group_id,stat.value,stat.some_date,stat.thingy_id
FROM all_stats AS stat
INNER JOIN group_contexts AS groupcontext
ON groupcontext.context_id = stat.context_id
WHERE stat.value IS NOT NULL
AND stat.value >= 0) AS data
ON data.group_id = groups.group_id) AS joined
) AS filtered
-- I already have the thingies beforehand but if it would be possible
-- to include/query for them in another way that'd be OK by me
WHERE filtered.thingy_id in (/* somewhere around 10000 thingies are available */)
-- Now I'm building the `UNION ALL`s for each thingy as well as
-- the group the stat of the thingy belongs to
-- thingy 42 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 982
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
(SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
FROM #stats AS s
WHERE s.group_id = 982
AND s.thingy_id = 42
ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
UNION ALL
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 314159
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
(SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
FROM #stats AS s
WHERE s.group_id = 314159
AND s.thingy_id = 42
ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
-- }
UNION ALL
-- thingy 21 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 21 in group 982
/* you get the idea */
This works - slowly, but it works - for small sets of data (e.g. say 100 thingies that have 10 stats attached each) but the problem domain it has to eventually work is in 10000+ thingies with potentially hundreds of stats per thingy. As a side note: the generated SQL query is ridiculously large: a pretty small query involves say 350 thingies that have data in 3 context groups and it's amounting to more than 250 000 formatted lines of SQL - executing in a stunning 5 minutes.
So if anyone has an idea how to solve this I really, really would appreciate your help :-).
On your ancient SQL Server release you need to use some old-style Scalar Subquery to get the last three rows for all thingies in a single query :-)
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
(
SELECT s.group_id,s.thingy_id,s.value
FROM #stats AS s
where (select count(*) from #stats as s2
where s.group_id = s2.group_id
and s.thingy_id = s2.thingy_id
and s.some_date <= s2.some_date
) <= 3
) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
To get better performance you need to add a clustered index, probably (group_id,thingy_id,some_date desc,value)
to the #stats
table.
If group_id,thingy_id,some_date
is unique you should remove the useless ID
column, otherwise order by group_id,thingy_id,some_date desc
during the Insert/Select
into #stats
and use ID
instead of some_date
for finding the last three rows.