Search code examples
sqlpostgresqlnlprelational-division

Efficient query to find rows matching a rule based on multiple rows


I have a PostgreSQL table of Czech words (>1M rows) with a column named "word" [text] and I want to find all words with the same declination (see Czech declination) based on the word ending.

E.g. I want to find all words that end with "e" (e.g. kuře), but for which also exists word forms that end with "ete" (e.g. kuřete) and also "etem" (e.g. kuřetem) and also "eti" (e.g. kuřeti). For each word exists approx. 14 word forms.

What is an efficient way (SQL query) to find all words that match the rule?


Solution

  • This is a case of relational division.

    Assuming a table of UNIQUE words like:

    CREATE TABLE words (word text PRIMARY KEY);
    

    This should be among the fastest possible solutions:

    SELECT w0.stem
    FROM  (
       SELECT left(word, -4) AS stem  -- -4 = length('etem')
       FROM   words
       WHERE  word LIKE '%etem'  -- pick the most selective ending to get started
       ) w0
    JOIN   words w1 ON w1.word = stem || 'eti'
    JOIN   words w2 ON w2.word = stem || 'ete'
    JOIN   words w3 ON w3.word = stem || 'e';
    

    Finds all word stems that appear with all the given endings. More words starting with the same stem and different endings do not disqualify!

    If you have to check for many endings (14?), it may get tedious to spell it all out. Shorter code, typically slower:

    SELECT w0.stem
    FROM  (
       SELECT left(word, -4) AS stem
       FROM   words
       WHERE  word LIKE '%etem'  -- pick the most selective ending to get started
       ) w0
    CROSS  JOIN unnest ('{eti,ete,e}'::text[]) x(dec)  -- all other in an array
    JOIN   words w1 ON w1.word = w0.stem || x.dec
    GROUP  BY w0.stem
    HAVING count(*) = 3;  -- = cardinality('{eti,ete,e}'::text[])
    

    db<>fiddle here

    Related:

    Text search operators and indexes might be of interest. But you need a Czech stemmer first, which is not included in the standard Postgres distribution. Related: