I have a PostgreSQL table of Czech words (>1M rows) with a column named "word" [text] and I want to find all words with the same declination (see Czech declination) based on the word ending.
E.g. I want to find all words that end with "e" (e.g. kuře), but for which also exists word forms that end with "ete" (e.g. kuřete) and also "etem" (e.g. kuřetem) and also "eti" (e.g. kuřeti). For each word exists approx. 14 word forms.
What is an efficient way (SQL query) to find all words that match the rule?
This is a case of relational division.
Assuming a table of UNIQUE
words like:
CREATE TABLE words (word text PRIMARY KEY);
This should be among the fastest possible solutions:
SELECT w0.stem
FROM (
SELECT left(word, -4) AS stem -- -4 = length('etem')
FROM words
WHERE word LIKE '%etem' -- pick the most selective ending to get started
) w0
JOIN words w1 ON w1.word = stem || 'eti'
JOIN words w2 ON w2.word = stem || 'ete'
JOIN words w3 ON w3.word = stem || 'e';
Finds all word stems that appear with all the given endings. More words starting with the same stem and different endings do not disqualify!
If you have to check for many endings (14?), it may get tedious to spell it all out. Shorter code, typically slower:
SELECT w0.stem
FROM (
SELECT left(word, -4) AS stem
FROM words
WHERE word LIKE '%etem' -- pick the most selective ending to get started
) w0
CROSS JOIN unnest ('{eti,ete,e}'::text[]) x(dec) -- all other in an array
JOIN words w1 ON w1.word = w0.stem || x.dec
GROUP BY w0.stem
HAVING count(*) = 3; -- = cardinality('{eti,ete,e}'::text[])
db<>fiddle here
Related:
Text search operators and indexes might be of interest. But you need a Czech stemmer first, which is not included in the standard Postgres distribution. Related: