Search code examples
mysqlsqlfull-text-searchinnodb

MySQL InnoDB FULLTEXT search rankings for phrase searches


I'm running FULLTEXT searches against an InnoDB table, looking for phrases rather than separate keywords. eg. searching for "foo bar" as a two word phrase rather than "foo" and "bar" individually.

Here is my test data:

+----+-------------------------------------------------------------------------------+
| id | content                                                                       |
+----+-------------------------------------------------------------------------------+
|  1 | example foo text bar                                                          |
|  2 | default value foo foo server                                                  |
|  3 | default value foo foo server bar foo test                                     |
|  4 | process foo bar potato integer text bar bar content foo foo value bar foo foo |
|  5 | foo bar demo string value foo bar music foo bar most foo bar                  |
+----+-------------------------------------------------------------------------------+

And here is my test query:

SELECT *, MATCH(content) AGAINST ('"foo bar"' IN BOOLEAN MODE) AS score
FROM test
WHERE MATCH(content) AGAINST ('"foo bar"' IN BOOLEAN MODE)

The problem is the results:

+----+-------------------------------------------------------------------------------+--------------------+
| id | content                                                                       | score              |
+----+-------------------------------------------------------------------------------+--------------------+
|  4 | process foo bar potato integer text bar bar content foo foo value bar foo foo |  0.948742687702179 |
|  5 | foo bar demo string value foo bar music foo bar most foo bar                  | 0.8314893841743469 |
+----+-------------------------------------------------------------------------------+--------------------+

As you can see, row 5 contains the phrase "foo bar" four times, whereas row 4 only contains it once, but row 4 is being ranked higher than 5. It looks as though the ranking is ignoring the phrase requirement.

Does anyone know how to make this work properly?


Solution

  • I suspect this is a quirk of phrases. The following documentation ranks among the most inconsistent language that I've seen:

    A phrase that is enclosed within double quote (") characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in the FULLTEXT index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order.

    The first sentence directly conflicts with the rest of the explanation. I've highlighted the part that I think is important.

    So, I'll speculate. MySQL does the phrase matching at the word level. So, more "foo"s and "bar"s in the result -- without "foo bar" add to the score. There is some extra mechanism that makes sure that the pair is actually in the result set, but this does not affect the score.

    One thing you could do is your own ordering:

    order by length(content) - length(replace(content, 'foo bar', '')) desc
    

    This explicitly looks for "foo bar" in the content, ordering by that value.