sql postgresql iteration query-optimization

Replacing iterative INSERT from SELECT with recursion or functions to traverse paths in Postgres

In my schema a dataset has many cfiles and a cfile has one dataset. Each cfile also has a number of dynamic property values stored in jsonb.

Fiddle here

SELECT * FROM datasets;
 id |   name   
----+----------
  1 | Dataset1
  2 | Dataset2


SELECT * FROM cfiles WHERE dataset_id=1;
 id | dataset_id |            path             |                     property_values                      
----+------------+-----------------------------+----------------------------------------------------------
  1 |          1 | dir_i/file1.txt             | {"Project": "ProjW", "Sample Names": ["sampA", "sampB"]}
  2 |          1 | dir_i/dir_j/file2.txt       | {"Project": "ProjX", "Sample Names": ["sampA", "sampC"]}
  3 |          1 | dir_i/dir_j/dir_k/file3.txt | {"Project": "ProjY", "Sample Names": ["sampD"]}
  4 |          1 | dir_m/file4.txt             | {"Project": "ProjZ", "Sample Names": ["sampE"]}

Following from this SO question and fantastic answer, I have this query:

INSERT into agg_prop_vals(dataset_id, path, sample_names, projects)
  SELECT DISTINCT
  cfiles.dataset_id,
  '.' as path,
  -- ** path specific:
  -- 'dir_i/dir_j/' as path,
  h."Sample Names", h."Project"
  FROM (
    SELECT
    dataset_id,
    string_agg(DISTINCT "Sample Names", '; ' ORDER BY "Sample Names") as "Sample Names",
    string_agg(DISTINCT "Project", '; ' ORDER BY "Project") as "Project"
    FROM (
      SELECT
      cfiles.dataset_id as dataset_id,
      property_values ->> 'Project' as "Project",
      jsonb_array_elements_text(property_values -> 'Sample Names') as "Sample Names"
      FROM cfiles
      WHERE cfiles.dataset_id=1
      -- ** path specific:
      -- AND cfiles.path LIKE 'dir_i/dir_j/%'
    ) g GROUP BY dataset_id
  ) h
  JOIN cfiles ON (cfiles.dataset_id=h.dataset_id)
  WHERE cfiles.dataset_id=1
  ON CONFLICT (dataset_id, path)
  DO UPDATE SET
    sample_names = excluded.sample_names,
    projects = excluded.projects

that produces a table of aggregated cfile property values for the specific dataset:

SELECT * FROM agg_prop_vals;
 dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
          1 | .                  | sampA; sampB; sampC; sampD; sampE | ProjW; ProjX; ProjY; ProjZ

Now this is great for getting my aggregated values per dataset, but I would now also like to get them per dataset+path, so something like this:

SELECT * FROM agg_prop_vals;
 dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
          1 | .                  | sampA; sampB; sampC; sampD; sampE | ProjW; ProjX; ProjY; ProjZ
          1 | dir_i/             | sampA; sampB; sampC; sampD        | ProjW; ProjX; ProjY
          1 | dir_i/dir_j/       | sampA; sampC; sampD               | ProjX; ProjY
          1 | dir_i/dir_j/dir_k/ | sampD                             | ProjY
          1 | dir_m/             | sampE                             | ProjZ

All the processing is done one dataset at-a-time, so I'm happy to iterate over datasets (so WHERE cfiles.dataset_id=1 can be ignored/treated as a constant for this example). The problem I am having is iterating over paths.

I can run the same query above for each path in the dataset (eg uncommenting the ** path specific:) but when there are thousands of sub paths in a single dataset this can take up to an hour. eg:

("SELECT DISTINCT SUBSTRING(path, '(.*\/).*') FROM cfiles WHERE dataset_id=1").each do |sub_path|
  aggregate_query(sub_path)
end

But this is also inefficient because instead of using the already calculated aggregates of sub-directories at each level, it's performing the query on all the children cfiles again at each level.

ie to calculate:

 dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
          1 | .                  | sampA; sampB; sampC; sampD; sampE | ProjW; ProjX; ProjY; ProjZ

it should be adding the pre-computed aggregates of the top level sub directories:

 dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
          1 | dir_i/             | sampA; sampB; sampC; sampD        | ProjW; ProjX; ProjY

plus:

 dataset_id |        path        |           sample_names            |          projects          
------------+--------------------+-----------------------------------+----------------------------
          1 | dir_m/             | sampE                             | ProjZ

rather than going through all the child cfiles again.

Is there any way I can replace the iteration with some kind of query or PL/SQL that uses recursion or functions to traverse up the directory paths and populate the agg_prop_vals table accordingly?

Other points:

I can't restructure/normalize the existing datasets or cfiles tables but I can alter the agg_prop_vals table and add additional tables
I don't necessarily need to upsert with the ON CONFLICT block - I can split this out in the app

Solution

I'll load up the output of "find /lib /bin /etc" which is about 27k lines into a table...

BEGIN;
CREATE TABLE _files( path TEXT NOT NULL );
\copy _files (path) from 'files.txt';
CREATE TABLE files( 
  id SERIAL PRIMARY KEY, 
  path TEXT NOT NULL,
  dataset_id INTEGER NOT NULL,
  attrib TEXT[] NOT NULL
 );
INSERT INTO files (path,dataset_id,attrib) SELECT path,n,ARRAY[RIGHT(path,1),RIGHT(path,2)]
 FROM _files CROSS JOIN (SELECT generate_series(1,10) n) n;
COMMIT;
VACUUM ANALYZE files;
CREATE INDEX files_dataset ON files(dataset_id);

I've added a generate_series to multiply the number of files by 10.

Column "attrib" contains two text values which will be your "samples".

I'll assume there are no double slashes in the paths, and that all the paths don't end with a slash. If that's not the case, you'll have to drop this at the appropriate spot in the query:

regexp_replace( regexp_replace( path, '(//+)', '/', 'g' ), '/$', '')

Then let's add a parent_path column. Postgres regexps are slow, so this will take a while.

CREATE TEMPORARY TABLE fp AS
SELECT *, regexp_replace( path, '/[^/]+$', '' ) AS parent_path 
FROM files WHERE dataset_id=1;

Side note: to model paths/trees in SQL you can use a parent_id, or just stick the path in a column but in this case, an array works much better than a string, because it's easy to access the elements.

I've added an "attrib" column in the files table, of type TEXT[], which models the results of the query above that aggregates sample_names and projects. It's an array, because we'll have to take it apart later.

So. Now we must build a tree of directories, including the directories with no files in them, which are not amongst the parent_path generated above, which means they have to be generated with a recursive query. Because SQL is SQL, it doesn't start from the root, it starts from the full paths and goes in reverse.

CREATE TEMPORARY TABLE dirs (
      path TEXT UNIQUE NOT NULL,
      parent_path TEXT NOT NULL,
      attrib1 TEXT[] NULL,
      attrib2 TEXT[] NULL );

INSERT INTO dirs (path, parent_path)
WITH RECURSIVE pdirs AS (SELECT * FROM
  (SELECT parent_path AS path,
          regexp_replace( parent_path, '/[^/]+$', '' ) AS parent_path FROM fp
  ) x1
 UNION  SELECT * FROM
  (SELECT parent_path AS path,
          regexp_replace( parent_path, '/[^/]+$', '' ) AS parent_path FROM pdirs
  ) x2 WHERE path != '' OR parent_path != ''
 ) SELECT * FROM pdirs ORDER BY path;

Not, for each row in table fp, explode the attrib arrays in each row, remove duplicates, and reassemble that into an array. There are two ways to do this... the first one is faster but requires an index on the temporary table. So, let's use the second one.

SELECT dirs.path, (SELECT array_agg(a) FROM (SELECT DISTINCT unnest(attrib) a FROM fp WHERE fp.parent_path=dirs.path) x) FROM dirs;

SELECT parent_path, array_agg(DISTINCT att) FROM (SELECT parent_path, unnest(attrib) att FROM fp) x GROUP BY parent_path;

Now, "just" do the same recursively to table dirs to propagate the attributes up the path... Twice, once for samples and once for projects, because the recursive CTE can't be referenced more than once in the query...

WITH RECURSIVE rdirs AS (
  SELECT dirs.*, attrib FROM
  (SELECT parent_path, array_agg(DISTINCT att) attrib FROM (SELECT parent_path, unnest(attrib) att FROM fp) x GROUP BY parent_path) AS x
  JOIN dirs ON (dirs.path=x.parent_path)
UNION ALL
  SELECT dirs.*, attrib FROM
  (SELECT parent_path, array_agg(DISTINCT att) attrib FROM (SELECT parent_path, unnest(attrib) att FROM rdirs) x GROUP BY parent_path) AS x
  JOIN dirs ON (dirs.path=x.parent_path)
  WHERE dirs.path != '' OR dirs.parent_path != ''
)
UPDATE dirs 
SET attrib1=rdirs.attrib
FROM rdirs
WHERE dirs.path=rdirs.path;

So, you do it again with the projects column (changing column names accordingly), and temporary table dirs should contain the desired result!

It would most likely be possible to do all this with only one query and no temporary tables, if you like the challenge!