Search code examples
sqlpostgresqlfull-text-searchtsvector

Full text search returning too many irrelevant results and causing poor performance


I'm using the full text search feature from Postgres and for the most part it works fine.

I have a column in my database table called documentFts that is basically the ts_vector version of the body field, which is a text column, and that's indexed with GIN index.

Here's my query:

select
      count(*) OVER() AS full_count,
      id,
      url, 
      (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) as "finalScore",
      ts_headline(\'english_unaccent\', title, websearch_to_tsquery($4, $1)) as title,
      ts_headline(\'english_unaccent\', body, websearch_to_tsquery($4, $1)) as body,
      "possibleEncoding",
      "responseYear"
    from "Entries"
    where 
      "language" = $3 and 
      "documentFts" @@ websearch_to_tsquery($4, $1)
    order by (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) desc limit 20 offset $2;

The dictionary is english_unaccent because I created one based on english that uses the unaccent extension by using:

CREATE TEXT SEARCH CONFIGURATION english_unaccent (
  COPY = english
);

ALTER TEXT SEARCH CONFIGURATION english_unaccent
  ALTER MAPPING FOR hword, hword_part, word WITH unaccent,
  english_stem;

I did the same for other languages.

And then I did this to my Entries db:

ALTER TABLE "Entries"
  ADD COLUMN "documentFts" tsvector;

UPDATE
  "Entries"
SET
  "documentFts" = (setweight(to_tsvector('english_unaccent', coalesce(title)), 'A') || setweight(to_tsvector('english_unaccent', coalesce(body)), 'C'))
WHERE
  "language" = 'english';

I have a column in my table with the language of the entry, hence the "language" = 'english'.

So, the problem I'm having is that for words like animal, anime or animation, they all go into the vector as anim, which means that if I search for any of those words I get results with all of those variations.

That returns a HUGE dataset that causes the query to be quite slow compared to searches that return fewer items. And also, if I search for Anime, my first results contain Animal, Animated and the first result that has the word Anime is the 12th one.

Shouldn't animation be transformed to animat in the vector and animal just be animal as the other variations for it are animals or animalia?

I've been searching for a solution to this without much luck, is there any way I can improve this, I'm happy to install extensions, reindex the column or whatever.


Solution

  • There are so many little details to this. The best solution depends on the exact situation and exact requirements.

    Two simple options:

    Simple tweak 1

    If you want to sort rows where title or body have a word starting with 'Anime' (exactly) in it, matched case-insensitively, add an ORDER BY expression like:

    ORDER  BY unaccent(concat_ws(' ', title, body) !~* ('\m' || f_regexp_escape($4))
            , (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) DESC
    
    

    Where the auxiliary function f_regexp_escape() escapes special regexp characters and is defined here:

    That expression is rather expensive, but since it's only applied to filtered results, the effect is limited. You may have to fine-tune, as other search terms present other difficulties. Think of 'body' / 'bodies' stemming to 'bodi' ...

    Simple tweak 2

    To remove English stemming completely, base yours on the 'simple' TEXT SEARCH CONFIGURATION:

    CREATE TEXT SEARCH CONFIGURATION simple_unaccent (
      COPY = simple
    );
    

    Etc.

    Then the actual language of the text is irrelevant.The index gets substantially bigger, and the search is done on literal spellings. You can now widen the search with prefix matching like:

    WHERE  "documentFts" @@ to_tsquery('simple_unaccent', ($1 || ':*')
    

    Again, you'll have to fine-tune. The simple example only works for single-word patterns. And I doubt you want to get rid of stemming altogether. Probably too radical.

    See:

    Proper solution: Synonym dictionary

    You need access to the installation drive of the Postgres server for this. So typically not possible with most hosted services.

    To overrule some of the stemmer decisions, overrule with your own set of synonym(rule)s. Create a mapping file in $SHAREDIR/tsearch_data/my_synonyms.syn. That's /usr/share/postgresql/13/tsearch_data/my_synonyms.syn in my Linux installation:

    Let it contain (case insensitive by default):

    anime anime
    

    Then:

    CREATE TEXT SEARCH DICTIONARY my_synonym (
        TEMPLATE = synonym,
        SYNONYMS = my_synonyms
    );
    

    There is a chapter with instructions in the manual. One quote:

    A synonym dictionary can be used to overcome linguistic problems, for example, to prevent an English stemmer dictionary from reducing the word “Paris” to “pari”. It is enough to have a Paris paris line in the synonym dictionary and put it before the english_stem dictionary.

    Then:

    CREATE TEXT SEARCH CONFIGURATION my_english_unaccent (
      COPY = english
    );
    
    ALTER TEXT SEARCH CONFIGURATION my_english_unaccent
      ALTER MAPPING FOR hword, hword_part, word
      WITH unaccent, my_synonym, english_stem;   -- added my_synonym!
    

    You have to update your column "documentFts" with my_english_unaccent. While being at it, use a proper lower-case column name like document_fts, and consider a GENERATED column. See:

    Now, searching for Anime (or ánime, for that matter) won't find animal any more. And searching for animal won't find Anime.