I have a table of images with sentence captions. Given a new sentence I want to find the images that best match it based on how close the new sentence is to the stored old sentences.
I know that I can use the @@
operator with a to_tsquery
but tsquery
accepts specific words as queries.
One problem is I don't know how to convert the given sentence into a meaningful query. The sentence may have punctuation and numbers.
However, I also feel that some kind of cosine similarity thing is what I need but I don't know how to get that out of PostgresQL. I am using the latest GA version and am happy to use the development version if that would solve my problem.
You could use plainto_tsquery()
to (per documentation) ...
produce
tsquery
ignoring punctuation
SELECT plainto_tsquery('english', 'Sentence: with irrelevant words (and punctuation) in it.')
plainto_tsquery
------------------
'sentenc' & 'irrelev' & 'word' & 'punctuat'
Use it like:
SELECT *
FROM tbl
WHERE to_tsvector('english', sentence) @@ plainto_tsquery('english', 'My new sentence');
But that is still rather strict and only provides very limited tolerance for similarity.
Might be better suited to search for similarity, even overcome typos to some degree.
Install the additional module pg_trgm
, create a GiST index and use the similarity operator %
in a nearest neighbour search:
Basically, with a trigram GiST index on sentence
:
-- SELECT set_limit(0.3); -- adjust tolerance if needed
SELECT *
FROM tbl
WHERE sentence % 'My new sentence'
ORDER BY sentence <-> 'My new sentence'
LIMIT 10;
More:
You can even combine FTS and trigram similarity: