Search code examples
postgresqlindexingfile-processing

Searching DB table to determine unseen file names from list


I am processing flat files from disk and need to ensure that I never process the same file twice. The filename of every processed file is stored in a postgresql DB and at the next iteration I need to determine the unseen files on disk and process them, ie. I need to determine the set difference of the filenames on disk and the filenames in the DB.

Currently my approach is to create a CTE from the filenames on disk and join that to the table of seen filenames. The list of files on disk is large and constantly changing, and processing is slowing down.

This is the current query:

WITH input(filename) AS (VALUES ${filenames.joinToString { "(?)" }})
SELECT input.filename FROM input 
LEFT JOIN my_table pm ON input.filename ILIKE pm.filename
WHERE pm.filename IS NULL

${filenames.joinToString { "(?)" }} expands to something like (?), (?), (?), depending on the number of filename parameters.

What can I do to speed up this process?

One thing that I have to do is add an index on the filename column. What kind of index is the correct choice?


Solution

  • Since you're using ILIKE, I wouldn't put an index on pm.filename, but on LOWER(pm.filename). This should allow you to remove ILIKE in favour of the more performant LIKE. This also means you can just use a simple B-tree index, as it works fine with LIKE. LIKE is useful if you use wildcards, but if you don't, just use normal =-equality.

    Finally, there is a good chance that the query optimiser already does a lot with the query, but I suggest you look at the EXPLAIN (ANALYSE) output of this query. I have some suggestions for improvement, but no idea on whether they will help or they will all be boiled down to the same query plan. That's completely up to you!


    This takes the result of the first query first list and removes any matches from the result of the second query. The downside is that the returned filenames are lowercased.

    SELECT LOWER(filename)
    FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
    EXCEPT ALL (SELECT LOWER(filename) FROM my_table pm)
    

    This query doesn't have this drawback, it just returns all filenames that do not have a match in my_table.

    SELECT filename
    FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
    WHERE NOT EXISTS (
      SELECT
      FROM my_table pm
      WHERE LOWER(pm.filename) = LOWER(input.filename)
    )
    

    The last query is probably equivalent to this one, but I'll add it for completeness.

    SELECT filename
    FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
    WHERE LOWER(filename) NOT IN (
      SELECT LOWER(pm.filename)
      FROM my_table pm
    )