Search code examples
sqloracleconnect-byhierarchical-queryenterprisedb

Connect by query


I'm storing hierarchical data in a table. When a resource is accessed by its hierarchical path (grantParent/parent/resource), I need to locate the resource using a CONNECT BY query.

Note: SQL commands are exported from EnterpriseDB, but it should work in Oracle too.

Table structure:

CREATE TABLE resource_hierarchy
(
  resource_id character varying(100) NOT NULL,
  resource_type integer NOT NULL,
  resource_name character varying(100),
  parent_id character varying(100)
)
WITH (
  OIDS=FALSE
);

Data:

INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('36d27991', 3, 'areaName',    'a616f392');
INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('a616f392', 3, 'townName',    'fcc1ebb7');
INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('fcc1ebb7', 2, 'stateName',   '8369cc88');
INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('8369cc88', 5, 'countryName', null);

Now, when I receive a path like

countryName/stateName/townName/areaName

I'm executing a query like,

select LEVEL,* from resource_hierarchy
WHERE resource_name = (
            CASE LEVEL 
                WHEN 1 THEN 'areaName'
                WHEN 2 THEN 'townName'
                WHEN 3 THEN 'stateName'
                WHEN 4 THEN 'countryName'
                ELSE ''
            END
         )
 connect by prior parent_id = resource_id
 start with resource_name = 'areaName';

My expected results are:

LEVEL   resource_id resource_type   resource_name   parent_id
-------------------------------------------------------------
1       36d27991    3               areaName        a616f392
2       a616f392    3               townName        fcc1ebb7
3       fcc1ebb7    2               stateName       8369cc88
4       8369cc88    5               countryName     <null>

This query works fine, but I'm not sure if it would run faster, when my table is big like hundreds of thousands of entries.

Can you optimize this query for my requirement?

Edited:

EXPLAIN for the above query: I've defined two indices - one on resource_id (primary key) and another on parent_id

Sort  (cost=66.85..66.86 rows=1 width=694)
  Sort Key: connectby_cte.siblingssortcol
  CTE prior
    ->  Recursive Union  (cost=0.00..65.83 rows=31 width=151)
      ->  WindowAgg  (cost=0.00..3.12 rows=1 width=83)
        ->  Seq Scan on resource_hierarchy  (cost=0.00..3.11 rows=1 width=83)
              Filter: ((resource_name)::text = 'areaName'::text)
      ->  WindowAgg  (cost=0.33..6.21 rows=3 width=151)
        ->  Hash Join  (cost=0.33..6.15 rows=3 width=151)
              Hash Cond: ((resource_hierarchy_1.resource_id)::text = (prior.parent_id)::text)
              Join Filter: connectby_cyclecheck(prior.recursionpath, (resource_hierarchy_1.parent_id)::text)
              ->  Seq Scan on resource_hierarchy resource_hierarchy_1  (cost=0.00..2.89 rows=89 width=83)
              ->  Hash  (cost=0.20..0.20 rows=10 width=286)
                ->  WorkTable Scan on prior  (cost=0.00..0.20 rows=10 width=286)
  ->  CTE Scan on prior connectby_cte  (cost=0.00..1.01 rows=1 width=694)
    Filter: ((resource_name)::text = CASE level WHEN 1 THEN 'areaName'::text WHEN 2 THEN 'townName'::text WHEN 3 THEN 'stateName'::text WHEN 4 THEN 'countryName'::text ELSE ''::text END)

Solution

  • Disclaimer: My primary experience belongs to Oracle DBMS, so pay attention to details if applying solution to Postgres.


    Where clause applied after full hierarchy already built, therefore in original query database engine started retrieving data with specified resource_name at any level and building a full tree for each found record. Filtering occurs only on the next step.
    Documentation:

    1. Oracle selects the root row(s) of the hierarchy—those rows that satisfy the START WITH condition.

    2. Oracle selects the child rows of each root row. Each child row must satisfy the condition of the CONNECT BY condition with respect to one of the root rows.

    3. Oracle selects successive generations of child rows. Oracle first selects the children of the rows returned in step 2, and then the children of those children, and so on. Oracle always selects children by evaluating the CONNECT BY condition with respect to a current parent row.

    4. If the query contains a WHERE clause without a join, then Oracle eliminates all rows from the hierarchy that do not satisfy the condition of the WHERE clause. Oracle evaluates this condition for each row individually, rather than removing all the children of a row that does not satisfy the condition.

    To optimize this situation query must be changed as follows(hierarchy reversed to more natural top-down order):

    select 
      level, rh.* 
    from 
      resource_hierarchy rh
    start with 
      (resource_name = 'countryName')
      and 
      (parent_id is null) -- roots only
    connect by 
      prior resource_id = parent_id
      and          
      -- at each step get only required records
      resource_name = (
        case level 
          when 1 then 'countryName'
          when 2 then 'stateName'
          when 3 then 'townName'
          when 4 then 'areaName'
          else null
        end
      )
    

    Same query may be writed on the basis of CTE syntax (Oracle recursive subquery factoring).
    Following is a variant for PostgreSQL CTE, corrected according to @Karthik_Murugan suggestion:

    with RECURSIVE hierarchy_query(lvl, resource_id) as (
        select
          1               lvl, 
          rh.resource_id  resource_id
        from
          resource_hierarchy rh
        where
         (resource_name = 'countryName') and (parent_id is null) 
    
      union all
    
        select
          hq.lvl+1        lvl,
          rh.resource_id  resource_id
        from
          hierarchy_query    hq,
          resource_hierarchy rh
        where
          rh.parent_id = hq.resource_id
          and
          -- at each step get only required records
          resource_name = (
            case (hq.lvl + 1)
              when 2 then 'stateName'
              when 3 then 'townName'
              when 4 then 'areaName'
              else null
            end
          )
    )
    select
      hq.lvl, rh.*
    from
      hierarchy_query    hq,
      resource_hierarchy rh
    where
      rh.resource_id = hq.resource_id
    order by
      hq.lvl
    

    It's only half of the work because we need to help database engine to locate records by creating appropriate indexes.
    Query above contains two search actions:
    1. Locate records to start with;
    2. Choose records on each next level.

    For the first action, we need to index resource_name field and, possible, parent_id field.
    For the second action fields parent_id and resource_name must be indexed.

    create index X_RESOURCE_HIERARCHY_ROOT on RESOURCE_HIERARCHY (resource_name);
    create index X_RESOURCE_HIERARCHY_TREE on RESOURCE_HIERARCHY (parent_id, resource_name);
    

    Maybe creating only X_RESOURCE_HIERARCHY_TREE index is enough. It depends on characteristics of data stored in a table.

    P.S. String for each level can be constructed from full path by using substr and instr functions like in this example for Oracle:

    with prm as (
      select 
        '/countryName/stateName/townName/areaName/' location_path 
      from dual
    )
    select 
      substr(location_path,
        instr(location_path,'/',1,level)+1,
        instr(location_path,'/',1,level+1)-instr(location_path,'/',1,level)-1
      )          
    from prm connect by level < 7