I have a set of data that has been created by matching together similar sub-items, and then GROUPing these similar items by "category".
Now, the resultant categories must be matched in such a way that groups related categories together within each "group_id". In the example below, one match is A->B->C->D->E->F->G, which is obtained by recursing through rows.
I've posted my current answer, which works on this simple data set, but because the actual data set contains up to 1M rows, and there may be up to 60 categories per "group_id," this query causes an "out of spool space" error on real data.
Please note:
A correct answer will
Sample Input:
Desired Output:
You need a recursive apporach, but your WITH RECURSIVE
creates a humongous intermediate result, which leads to no more spool.
For a similar process I used following approach (originally USING A WHILE-loop in a Stored Procedure):
CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log AS
(
SELECT group_id, category_1, category_2,
-- assign a unique number to
Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk
-- remove when you source data is unique
GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
FROM match_detail
)
WITH DATA
PRIMARY INDEX (category_2)
ON COMMIT PRESERVE ROWS;
Now repeat the following update until 0 rows processed
:
-- find matching categories and assign them a common number
UPDATE vt_tmp FROM
( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
FROM vt_tmp e1 JOIN vt_tmp e2
ON e1.category_2 = e2.category_2
AND e1.rnk < e2.rnk
GROUP BY e2.group_id, e2.category_1
) x
SET rnk = minrnk
WHERE
vt_tmp.group_id = x.group_id
AND vt_tmp.category_1 = x.category_1
;
To get the related categories you finally need:
SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk
FROM vt_tmp
For an exact match of your expected result you need to add a DENSE_RANK
:
SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
FROM
(
SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk
FROM vt_tmp
) AS dt