Search code examples
databasesqlitecsvdirectorystructure

SQLite: Create Directory Structure Table from A List Of Paths


I want to create a directory structure table as described in this question where:

Directory = "Primary Key" id field, typically an integer
Directory_Parent = "Foreign Key" id field, which points to the id of another Directory in the same table
Value = string containing the directory/folder name

Given Tree/Fruit/Apples/

Directory | Directory_Parent | Value
0           null               Root
1           0                  Tree
2           1                  Fruit
3           2                  Apples

A Root folder has been created at Primary Key 0 with a null parent.


My paths are being imported from CSV and currently are in a table with 2 columns:

 FileID  Path
      1  videos/gopro/father/mov001.mp4
      2  videos/gopro/father/mov002.mp4
      3  pictures/family/father/Oldman.jpg
      4  pictures/family/father/Oldman2.jpg
      5  documents/legal/father/estate/will.doc
      6  documents/legal/father/estate/will2.doc
      7  documents/legal/father/estate/newyork/albany/will.doc
      8  video/gopro/father/newyork/albany/holiday/christmas/2002/mov001.mp4
      9  pictures/family/father/newyork/albany/holiday/christmas/2002/july/Oldman.jpg
      10 pictures/family/father/newyork/albany/holiday/christmas/2002/june/Oldman2.jpg

This table contains 1 million file entries.
What's a fast & optimized way to parse this data and move the folder structure into a new table as described above?

In this demo the folders are separated by "/" and moved into a new column, if that helps.


Solution

  • SQL lacks the flexibility and tools of a programming language, which would provide a fast and optimized solution to this problem.

    Moreover SQLite is the poorest of the databases when it comes to string manipulation because it does not support a function like SQL Server's STRING_SPLIT() or MySql's SUBSTRING_INDEX() which would be very helpful.

    Nevertheless the problem is interesting and I gave it a shot.

    I create the table dir_struct with this statement:

    CREATE TABLE dir_struct (
      Directory INTEGER PRIMARY KEY, 
      Directory_Parent INTEGER REFERENCES dir_struct(Directory), 
      Value TEXT
    );
    

    and I insert the 'root' row:

    INSERT INTO dir_struct (Directory, Directory_Parent, Value) VALUES (0, null, 'root');
    

    Also, I turn OFF foreign key enforcement with:

    PRAGMA foreign_keys = OFF;
    

    although it is by default turned off, just in case.

    First you need a recursive CTE that splits the paths to individual directories (much like the answer on your previous question).
    Then in the 2nd CTE, with conditional aggregation, each directory goes in its own column (with the limitation of max 10 directories).
    The 3d CTE removes duplicates and the 4th CTE assigns, with ROW_NUMBER() window function, unique ids to the directories.
    Finally with a self join of the results of the 4th CTE, the rows are inserted in the table:

    WITH 
      split AS (
        SELECT 0 idx,
               FileDataID,
               SUBSTR(SUBSTR(Path, 1), 1, INSTR(SUBSTR(Path, 1), '/') - 1) item,
               SUBSTR(SUBSTR(Path, 1), INSTR(SUBSTR(Path, 1), '/') + 1) value
        FROM listfile
        UNION ALL
        SELECT idx + 1,
               FileDataID,
               SUBSTR(value, 1, INSTR(value, '/') - 1),
               SUBSTR(value, INSTR(value, '/') + 1)
        FROM split
        WHERE value LIKE '%_/_%' 
      ),
      cols AS (
        SELECT DISTINCT
               MAX(CASE WHEN idx = 0 THEN item END) path0,
               MAX(CASE WHEN idx = 1 THEN item END) path1,
               MAX(CASE WHEN idx = 2 THEN item END) path2,
               MAX(CASE WHEN idx = 3 THEN item END) path3,
               MAX(CASE WHEN idx = 4 THEN item END) path4,
               MAX(CASE WHEN idx = 5 THEN item END) path5,
               MAX(CASE WHEN idx = 6 THEN item END) path6,
               MAX(CASE WHEN idx = 7 THEN item END) path7,
               MAX(CASE WHEN idx = 8 THEN item END) path8,
               MAX(CASE WHEN idx = 9 THEN item END) path9
        FROM split
        GROUP BY FileDataID
      ),
      paths AS (
        SELECT path0, path1, path2, path3, path4, path5, path6, path7, path8, path9 FROM cols UNION
        SELECT path0, path1, path2, path3, path4, path5, path6, path7, path8, null FROM cols UNION
        SELECT path0, path1, path2, path3, path4, path5, path6, path7, null, null FROM cols UNION
        SELECT path0, path1, path2, path3, path4, path5, path6, null, null, null FROM cols UNION
        SELECT path0, path1, path2, path3, path4, path5, null, null, null, null FROM cols UNION
        SELECT path0, path1, path2, path3, path4, null, null, null, null, null FROM cols UNION
        SELECT path0, path1, path2, path3, null, null, null, null, null, null FROM cols UNION
        SELECT path0, path1, path2, null, null, null, null, null, null, null FROM cols UNION
        SELECT path0, path1, null, null, null, null, null, null, null, null FROM cols UNION
        SELECT path0, null, null, null, null, null, null, null, null, null FROM cols
      ), 
      ids AS (
        SELECT *, 
               ROW_NUMBER() OVER (ORDER BY path0, path1, path2, path3, path4, path5, path6, path7, path8, path9) nr,
               COALESCE(path9, path8, path7, path6, path5, path4, path3, path2, path1, path0) last_child,
               path0 || COALESCE('/' || path1, '') ||
                        COALESCE('/' || path2, '') ||
                        COALESCE('/' || path3, '') ||
                        COALESCE('/' || path4, '') ||
                        COALESCE('/' || path5, '') ||
                        COALESCE('/' || path6, '') ||
                        COALESCE('/' || path7, '') ||
                        COALESCE('/' || path8, '') ||
                        COALESCE('/' || path9, '') full_path
        FROM paths       
      )
    INSERT INTO dir_struct(Directory, Directory_Parent, Value)
    SELECT i1.nr, COALESCE(i2.nr, 0), i1.last_child
    FROM ids i1 LEFT JOIN ids i2
    ON i1.full_path = i2.full_path || '/' || i1.last_child
    

    In my test dataset which consists of 187365 rows, the rows were inserted in (average) 9.5-10 minutes, which will be much longer for your larger dataset.

    See the demo.

    What is more interesting is that with simpler code, the performance is worse (but you can also test it):

    WITH 
      split AS (
        SELECT Path,
               0 parent_len,
               SUBSTR(SUBSTR(Path, 1), 1, INSTR(SUBSTR(Path, 1), '/') - 1) item,
               SUBSTR(SUBSTR(Path, 1), INSTR(SUBSTR(Path, 1), '/') + 1) value
        FROM listfile
        UNION ALL
        SELECT Path,
               parent_len + LENGTH(item) + 1, 
               SUBSTR(value, 1, INSTR(value, '/') - 1),
               SUBSTR(value, INSTR(value, '/') + 1)
        FROM split
        WHERE value LIKE '%_/_%' 
      ), 
      row_numbers AS (
        SELECT parent_path, item, 
               ROW_NUMBER() OVER (ORDER BY parent_path, item) rn
        FROM (SELECT DISTINCT SUBSTR(Path, 1, parent_len) parent_path, item FROM split)       
      )
    INSERT INTO dir_struct(Directory, Directory_Parent, Value)  
    SELECT r1.rn, COALESCE(r2.rn, 0) rn_parent, r1.item 
    FROM row_numbers r1 LEFT JOIN row_numbers r2
    ON r1.parent_path = r2.parent_path || r2.item || '/'
    

    The ids assigned to the directories by this query are different than the ones assigned by the first solution, but they are correct and unique.

    This runs in (average) 14-15 minutes.
    See the demo.

    The conclusion is that if this is a one time thing, maybe you can use it, but I would not recommend it as a solution to this requirement.