Search code examples
amazon-redshiftunpivot

how to unpivot large AWS Redshift table


I am trying to run a query against a table in AWS Redshift (i.e., postgresql). Below is a simplified definition of the table:

CREATE TABLE some_schema.some_table (
    row_id int
    ,productid_level1 char(1)
    ,productid_level2 char(1)
    ,productid_level3 char(1)
)
;

INSERT INTO some_schema.some_table
VALUES
    (1, a, b, c)
    ,(2, d, c, e)
    ,(3, c, f, g)
    ,(4, e, h, i)
    ,(5, f, j, k)
    ,(6, g, l, m)
;

I need to return a de-duped, single column table of a given productid and all of its children. "Children" means any productid that has "level" higher than the given product (for a given row) and also its grandchildren.

For example, for productid 'c', I expect to return...

  • 'c' (because it's found in rows 1, 2, and 3)
  • 'e' (because it's a child of 'c' in row 2)
  • 'f' and 'g' (because they're children of 'c' in row 3)
  • 'h' and 'i' (because they're children of 'e' in row 4)
  • 'j' and 'k' (because they're children of 'f' in row 5)
  • and 'l' and 'm' (because they're children of 'g' in row 6)

Visually, I expect to return the following:

productid
---------
c
e
f
g
h
i
j
k
l
m

The actual table has about 3M rows and has about 20 "levels".

I think there are 2 parts to this query -- (1) a recursive CTE to build out the hierarchy and (2) an unpivot operation.

I have not attempted (1) yet. For (2), I have tried a query like the following, but it hasn't returned even after 3 minutes. As this will be used for an operational report, I need it to return in < 15 seconds.

select
    b.productid
    ,b.product_level
from
    some_schema.some_table as a
cross join lateral (
    values
    (a.productid_level1, 1)
    ,(a.productid_level2, 2)
    ...
    ,(a.productid_level20, 20)
) as b(productid, product_level)

How can I write the query to achieve (1) and (2) and be very performant?


Solution

  • I would avoid using the term Hierarchy, as that "usually" implies any node having a single parent at most.

    I admit I'm lost as to the nature of the graph/network this table represents. But you might benefit from a little brute force and code repetition.

    Whatever eventually works for you, I think you'll need to persist/materialise/cache the results, as repeating this at report time is unlikely to ever be a good idea.

    I'm a data engineer by trade, and I'm sure they have good reasons for what they've done (or, like me, they maybe screwed up). Either way, there are many good reasons to ask them to materialise the graph in more than just one form, each suited to different use cases. So, asking them for a traditional adjacency list, as well as the table you already have, is a reasonable request. Or, at the very least, a good starting point for a conversation.

    So, a brute force approach?

    WITH
      adjacency AS
    (
      SELECT level01, level02 FROM some_table WHERE level02 IS NOT NULL
      UNION
      SELECT level02, level03 FROM some_table WHERE level03 IS NOT NULL
      UNION
      ...
      UNION
      SELECT level19, level20 FROM some_table WHERE level20 IS NOT NULL
    )
    

    The WHERE clause elimates any sparse data before it enters the map.

    The UNION (without ALL) ensures duplicate links are eliminated. You should also test UNION ALL and then wrapping a SELECT DISTINCT around it (or similar).

    Then you can use that adjacency list in the usual recursive walk, to find all children of a given node. (Taking care that there aren't any cyclic paths.)