Search code examples
mysqlindexingfull-text-searchresultsetfulltext-index

Mysql show different results between 2 servers with the same data for FULLTEXT index


I have a local env with Mysql 5.7.19 (on windows 10 Pro French) and a prod server with Mysql 5.7.31 (Ubuntu Linux 16.04.5).

The data is synchronised from Prod to local ENV. I have a fullText index on 3 columns and a simple request :

SELECT MATCH (r0_.title, r0_.description, r0_.tag_text)
       AGAINST ('+poulet* +carotte*' IN BOOLEAN MODE) AS sclr_0,
       r0_.id AS id_1, r0_.title AS title_2, r0_.description AS description_3,
       r0_.url AS url_4, r0_.image AS image_5, r0_.slug AS slug_6, r0_.click AS click_7, r0_.tag_text AS tag_text_8, r0_.active AS active_9, r0_.created_at AS created_at_10, r0_.updated_at AS updated_at_11
    FROM recipe r0_
    WHERE r0_.active = 1
    HAVING sclr_0 >= 1
    ORDER BY sclr_0 DESC;

On local env => 98 results
On prod env => 0 result

Create schema :

CREATE TABLE `recipe` (
  `id` int(11) NOT NULL,
  `blog_id` int(11) NOT NULL,
  `title` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `description` longtext COLLATE utf8mb4_unicode_ci NOT NULL,
  `url` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `image` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `slug` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `click` smallint(6) NOT NULL,
  `created_at` datetime NOT NULL,
  `updated_at` datetime NOT NULL,
  `tag_text` varchar(1000) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `active` tinyint(1) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

ALTER TABLE `recipe`
  ADD PRIMARY KEY (`id`),
  ADD KEY `IDX_DA88B137DAE07E97` (`blog_id`),
  ADD KEY `IDX_DA88B1374B1EFC02` (`active`),
  ADD KEY `IDX_DA88B1378B8E8428` (`created_at`);
ALTER TABLE `recipe` ADD FULLTEXT KEY `IDX_DA88B1372B36786B6DE44026D5841871`
        (`title`,`description`,`tag_text`);

More data on Prod actually because new recipe but no result.


Solution

  • The relevancy score is calculated based on the content of the complete table:

    InnoDB uses a variation of the “term frequency-inverse document frequency” (TF-IDF) weighting system to rank a document's relevance for a given full-text search query. The TF-IDF weighting is based on how frequently a word appears in a document, offset by how frequently the word appears in all documents in the collection. In other words, the more frequently a word appears in a document, and the less frequently the word appears in the document collection, the higher the document is ranked.

    "Document" here means a single row, "document collection" means all rows. The manual contains the exact formula, but the important thing is: since you have more recipes on prod compared to local, the score will be different. If you e.g. added more recipes containing carottes, the score will go down, if you added recipes that don't contain your search terms, the score will go up.

    This is completely independent of how good the single result on its own actually is! A chicken carrot stew is a good fit to your search, but the absolute score will vary if you also have a recipe for carrot cake in your database or not.

    So the absolute value of the score itself is usually not a good criteria to filter on, e.g. with your where score > 1, but as a way to order the results you get, e.g. with order by score desc, usually including a limit.

    It is unlikely that you will find a good absolute minimum value for your score (except for 0) that would make sense in general:

    • if you find a nice value for now, it may be too high in 2 weeks if carrots become more popular and you add recipes for those (similar to your experience on prod). Or vice versa, if you used a specific value of 1 that will get rid of unwanted, lower score results, they might reappear in 2 weeks if you add carrot-unrelated recipes - not because those unwanted results are suddenly better, but because they became rarer.
    • if you found a nice value that fits for searches containing carrots, it may not be a good value for other search term. If you e.g. search for a frequent ingredient, maybe "sugar", you will still expect results that contain "sugar" even though the absolute value will be low - just because it is used more often than carrots.

    But a recipe that uses the word sugar very often (as it may be an important ingredient, maybe a recipe for caramel) will have a higher score than those that only mentions it once ("add some sugar"), so you can use the value to order your results relative to each other.