Search code examples
sqlsql-serversqlperformancesql-server-performance

SQL: Find missing hierarchy Folders (Paths) in a table


I have a table which contains Folders Paths. I need to find all the "gaps" between those folders in the hierarchy. I mean that, if the table contains these 3 folders:

'A'
'A\B\C'
'A\B\C\D\E\F\G'

I need to find the following missing folders in the hierarchy:

'A\B'
'A\B\C\D'
'A\B\C\D\E'
'A\B\C\D\E\F'

This table contains more than 250,000 records of folders, so we seek for the most efficient way to do so, otherwise the script will be stuck for long time, time we don't have.

Comment: I don't have list of all folders. What I have are the "root" folders and the "leafs" folders which I need to find the "gaps" between them in the hierarchy.

Second comment: The table can contains more than one hierarchy and we need to find the "gaps" in all of the hierarchies. For that matter, There are 2 another int columns: "DirID" and "BaseDirID". The "DirID" column is the id column in our table. The "BaseDirID" contains the id of the first folder in the hierarchy. So all the folders (paths) from the same hierarchy share the same value in this column. Sample data for example:

Example sample data

DirID   BaseDirID   DisplayPath
1   1   'A'
2   1   'A\B\C'
3   1   'A\B\C\D\E'
4   4   'U'
5   4   'U\V\W'
6   4   'U\V\W\X\Y'

So we need to find the following data:

Expected Results

BaseDirID   DisplayPath
1   'A\B'
1   'A\B\C\D'
4   'U\V'
4   'U\V\W\X'

Thanks in advance.


Solution

  • Here is one approach using Recursive CTE and split string function

    ;WITH existing_hierachies
         AS (SELECT DirID,
                    BaseDirID,
                    DisplayPath
             FROM   (VALUES (1,1,'A' ),
                            (2,1,'A\B\C' ),
                            (3,1,'A\B\C\D\E' ),
                            (4,4,'U' ),
                            (5,4,'U\V\W' ),
                            (6,4,'U\V\W\X\Y' )) tc (DirID, BaseDirID, DisplayPath) ),
         folders_list
         AS (SELECT ItemNumber,
                    item fol,
                    BaseDirID
             FROM   (SELECT row_number()over(partition by BaseDirID order by Len(DisplayPath) DESC)rn,*
                     FROM   existing_hierachies) a
                     CROSS apply dbo.[Delimitedsplit8k](DisplayPath, '\')
                     Where Rn = 1),
         rec_cte
         AS (SELECT *,
                    Cast(fol AS VARCHAR(4000))AS hierar
             FROM   folders_list
             WHERE  ItemNumber = 1
             UNION ALL
             SELECT d.*,
                    Cast(rc.hierar + '\' + d.fol AS VARCHAR(4000))
             FROM   rec_cte rc
                    JOIN folders_list d
                      ON rc.BaseDirID = d.BaseDirID
                      AND d.ItemNumber = rc.ItemNumber + 1)
    SELECT rc.BaseDirID,
           rc.hierar AS Missing_Hierarchies
    FROM   rec_cte rc
    WHERE  NOT EXISTS (SELECT 1
                       FROM   existing_hierachies eh 
                       WHERE  eh.BaseDirID = rc.BaseDirID 
                         AND  eh.DisplayPath = rc.hierar) 
    Order by rc.BaseDirID
    

    Result :

    +-----------+---------------------+
    | BaseDirID | Missing_Hierarchies |
    +-----------+---------------------+
    |         1 | A\B                 |
    |         1 | A\B\C\D             |
    |         4 | U\V                 |
    |         4 | U\V\W\X             |
    +-----------+---------------------+
    

    Split string function code

    CREATE FUNCTION [dbo].[DelimitedSplit8K]
            (@pString VARCHAR(8000), @pDelimiter CHAR(1))
    RETURNS TABLE WITH SCHEMABINDING AS
     RETURN
    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
         -- enough to cover NVARCHAR(4000)
      WITH E1(N) AS (
                     SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                     SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                     SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                    ),                          --10E+1 or 10 rows
           E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
           E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
     cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
                         -- for both a performance gain and prevention of accidental "overruns"
                     SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                    ),
    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                     SELECT 1 UNION ALL
                     SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
                    ),
    cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
                     SELECT s.N1,
                            ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)
                       FROM cteStart s
                    )
    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
     SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
            Item       = SUBSTRING(@pString, l.N1, l.L1)
       FROM cteLen l
    ;
    GO
    

    Referred from http://www.sqlservercentral.com/articles/Tally+Table/72993/