postgresql full-text-search tf-idf cosine-similarity

PostgreSQL: Find sentences closest to a given sentence

I have a table of images with sentence captions. Given a new sentence I want to find the images that best match it based on how close the new sentence is to the stored old sentences.

I know that I can use the @@ operator with a to_tsquery but tsquery accepts specific words as queries.

One problem is I don't know how to convert the given sentence into a meaningful query. The sentence may have punctuation and numbers.

However, I also feel that some kind of cosine similarity thing is what I need but I don't know how to get that out of PostgresQL. I am using the latest GA version and am happy to use the development version if that would solve my problem.

Solution

Full Text Search (FTS)

You could use plainto_tsquery() to (per documentation) ...

produce tsquery ignoring punctuation

SELECT plainto_tsquery('english', 'Sentence: with irrelevant words (and punctuation) in it.')

 plainto_tsquery
------------------
 'sentenc' & 'irrelev' & 'word' & 'punctuat'

Use it like:

SELECT *
FROM   tbl
WHERE  to_tsvector('english', sentence) @@ plainto_tsquery('english', 'My new sentence');

But that is still rather strict and only provides very limited tolerance for similarity.

Trigram similarity

Might be better suited to search for similarity, even overcome typos to some degree.

Install the additional module pg_trgm, create a GiST index and use the similarity operator % in a nearest neighbour search:

Basically, with a trigram GiST index on sentence:

-- SELECT set_limit(0.3);  -- adjust tolerance if needed

SELECT *
FROM   tbl
WHERE  sentence % 'My new sentence'
ORDER  BY sentence <-> 'My new sentence'
LIMIT  10;

Combine both

You can even combine FTS and trigram similarity:

PostgreSQL FTS and Trigram-similarity Query Optimization