Search code examples
sqlteradataansi-sql

Connected Components


I have a set of data that has been created by matching together similar sub-items, and then GROUPing these similar items by "category".

Now, the resultant categories must be matched in such a way that groups related categories together within each "group_id". In the example below, one match is A->B->C->D->E->F->G, which is obtained by recursing through rows.

I've posted my current answer, which works on this simple data set, but because the actual data set contains up to 1M rows, and there may be up to 60 categories per "group_id," this query causes an "out of spool space" error on real data.

Please note:

  • Due to company restrictions, I cannot use stored procedures.
  • I can't use user defined functions (UDFs)
  • I can't use user defined types (UDTs)

A correct answer will

  • Be written for Teradata or compatible
  • Be more efficient than my answer
  • Respect the restrictions I mention above

Sample Input:

Sample Input

Desired Output:

Desired Output


Solution

  • You need a recursive apporach, but your WITH RECURSIVE creates a humongous intermediate result, which leads to no more spool.

    For a similar process I used following approach (originally USING A WHILE-loop in a Stored Procedure):

    CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log  AS
     (
      SELECT group_id, category_1, category_2, 
         -- assign a unique number to 
         Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk
    
      -- remove when you source data is unique
      GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
      FROM match_detail 
     )
    WITH DATA
    PRIMARY INDEX (category_2)
    ON COMMIT PRESERVE ROWS;
    

    Now repeat the following update until 0 rows processed:

    -- find matching categories and assign them a common number    
    UPDATE vt_tmp FROM
     ( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
       FROM vt_tmp e1 JOIN vt_tmp e2
       ON e1.category_2 = e2.category_2
       AND e1.rnk < e2.rnk
       GROUP BY e2.group_id, e2.category_1
     ) x
    SET rnk = minrnk
    WHERE 
      vt_tmp.group_id = x.group_id
    AND vt_tmp.category_1 = x.category_1
    ;
    

    To get the related categories you finally need:

    SELECT group_id, category_1 AS category, rnk AS related_categories
    FROM vt_tmp
    UNION
    SELECT group_id, category_2, rnk 
    FROM vt_tmp
    

    For an exact match of your expected result you need to add a DENSE_RANK:

    SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
    FROM
     (
       SELECT group_id, category_1 AS category, rnk AS related_categories
       FROM vt_tmp
       UNION
       SELECT group_id, category_2, rnk 
       FROM vt_tmp
     ) AS dt