Search code examples
sqlpartitioningsql-server-7

Speeding up partitioning query on ancient SQL Server version


The Setup

I've got performance and conceptional problems with getting a query right on SQL Server 7 running on a dual core 2GHz + 2GB RAM machine - no chance of getting that out of the way, as you might expect :-/.

The Situation

I'm working with a legacy database and I need to mine for data to get various insights. I've got the all_stats table that contains all the stat data for a thingy in a specific context. These contexts are grouped with the help of the group_contexts table. A simplified schema:

+--------------------------------------------------------------------+
| thingies                                                           |
+--------------------------------------------------------------------|
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| all_stats                                                          |
+--------------------------------------------------------------------+
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
| context_id  | INT FOREIGN KEY REFERENCES contexts(id)              |
| value       | FLOAT NULL                                           |
| some_date   | DATETIME NOT NULL                                    |
| thingy_id   | INT NOT NULL FOREIGN KEY REFERENCES thingies(id)     |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| group_contexts                                                     |
+--------------------------------------------------------------------|
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
| group_id    | INT NOT NULL FOREIGN KEY REFERENCES groups(group_id) |
| context_id  | INT NOT NULL FOREIGN KEY REFERENCES contexts(id)     |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| contexts                                                           |
+--------------------------------------------------------------------+
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| groups                                                             |
+--------------------------------------------------------------------+
| group_id    | INT PRIMARY KEY IDENTITY(1,1)                        |
+--------------------------------------------------------------------+

The Problem

The task is, for a given set of thingies, to find and aggregate the 3 most recent (all_stats.some_date) stats of a thingy for all groups the thingy has stats for. I know it sounds easy but I can't get around how to do this properly in SQL - I'm not exactly a prodigy.

My Bad Solution (no it's really bad...)

My solution right now is to fill a temporary table with all the required data and UNION ALLing the data I need:

-- Before I'm building this SQL I retrieve the relevant groups
-- for being able to build the `UNION ALL`s at the bottom.
-- I also retrieve the thingies that are relevant in this context
-- beforehand and include their ids as a comma separated list -
-- I said it would be awfull ...

-- Creating the temp table holding all stats data rows
-- for a thingy in a specific group
CREATE TABLE #stats
(id INT PRIMARY KEY IDENTITY(1,1),
 group_id INT NOT NULL,
 thingy_id INT NOT NULL,
 value FLOAT NOT NULL,
 some_date DATETIME NOT NULL)

-- Filling the temp table
INSERT INTO #stats(group_id,thingy_id,value,some_date)
SELECT filtered.group_id, filtered.thingy_id, filtered.some_date, filtered.value
FROM
   (SELECT joined.group_id,joined.thingy_id,joined.value,joined.some_date
    FROM
       (SELECT groups.group_id,data.value,data.thingy_id,data.some_date
        FROM
            -- Getting the groups associated with the contexts
            -- of all the stats available
           (SELECT DISTINCT context.group_id
            FROM all_stats AS stat
            INNER JOIN group_contexts AS groupcontext
                ON groupcontext.context_id = stat.context_id
        ) AS groups
        INNER JOIN
            -- Joining the available groups with the actual
            -- stat data of the group for a thingy
           (SELECT context.group_id,stat.value,stat.some_date,stat.thingy_id
            FROM all_stats AS stat
            INNER JOIN group_contexts AS groupcontext
                ON groupcontext.context_id = stat.context_id
            WHERE stat.value IS NOT NULL
              AND stat.value >= 0) AS data
        ON data.group_id = groups.group_id) AS joined
    ) AS filtered
-- I already have the thingies beforehand but if it would be possible
-- to include/query for them in another way that'd be OK by me
WHERE filtered.thingy_id in (/* somewhere around 10000 thingies are available */)

-- Now I'm building the `UNION ALL`s for each thingy as well as
-- the group the stat of the thingy belongs to

-- thingy 42 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 982
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
   (SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
    FROM #stats AS s
    WHERE s.group_id = 982
      AND s.thingy_id = 42
    ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3

UNION ALL

-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 314159
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
   (SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
    FROM #stats AS s
    WHERE s.group_id = 314159
      AND s.thingy_id = 42
    ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
-- }

UNION ALL

-- thingy 21 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 21 in group 982
/* you get the idea */

This works - slowly, but it works - for small sets of data (e.g. say 100 thingies that have 10 stats attached each) but the problem domain it has to eventually work is in 10000+ thingies with potentially hundreds of stats per thingy. As a side note: the generated SQL query is ridiculously large: a pretty small query involves say 350 thingies that have data in 3 context groups and it's amounting to more than 250 000 formatted lines of SQL - executing in a stunning 5 minutes.

So if anyone has an idea how to solve this I really, really would appreciate your help :-).


Solution

  • On your ancient SQL Server release you need to use some old-style Scalar Subquery to get the last three rows for all thingies in a single query :-)

    SELECT x.group_id,x.thingy_id,AVG(x.value)
    FROM
     (
       SELECT s.group_id,s.thingy_id,s.value
       FROM #stats AS s
       where (select count(*) from #stats as s2
              where s.group_id = s2.group_id
                and s.thingy_id = s2.thingy_id 
                and s.some_date <= s2.some_date
             ) <= 3
     ) AS x
    GROUP BY x.group_id,x.thingy_id
    HAVING COUNT(*) >= 3
    

    To get better performance you need to add a clustered index, probably (group_id,thingy_id,some_date desc,value) to the #stats table.

    If group_id,thingy_id,some_date is unique you should remove the useless ID column, otherwise order by group_id,thingy_id,some_date desc during the Insert/Select into #stats and use ID instead of some_date for finding the last three rows.