Search code examples
mysqlfull-text-searchinnodbmyisammysql-5.6

FullText Search Innodb Fails, MyIsam Returns Results


I've upgraded a table from myisam to innodb but am not having the same performance. The innodb returns a 0 score when there should be some relation. The myisam table returns a match for the same term (I kept a copy of the old table so I can still run the same query).

SELECT MATCH (COLUMNS) AGAINST ('+"Term Ex"' IN BOOLEAN MODE) as score
FROM table_myisam
where id = 1;

Returns:

+-------+
| score |
+-------+
|     1 |
+-------+

but:

SELECT MATCH (COLUMNS) AGAINST ('+"Term Ex"' IN BOOLEAN MODE) as score
FROM table
where id = 1;

returns:

+-------+
| score |
+-------+
|     0 |
+-------+

I thought the ex might not have been indexed because innodb_ft_min_token_size was set to 3. I lowered that to 1 and optimized the table but that had no affect. The column contents are 99 characters long so I presumed the whole column wasn't indexed because of innodb_ft_max_token_size. I increased that as well to 150 and ran the optimize again but again had the same result.

The only difference between these tables is the engine and the character set. This table is using utf8, the myisam table is using latin1.

Has anyone seen these behavior, or have advice for how to resolve it?

UPDATE: I added ft_stopword_file="" to my my.cnf and ran OPTIMIZE TABLE table again. This time I got

optimize | note | Table does not support optimize, doing recreate + analyze instead

The query worked after this change. Ex is not a stop word though so not sure why it would make a difference.

A new query that fails though is:

SELECT MATCH (Columns) AGAINST ('+Term +Ex +in' IN BOOLEAN MODE) as score FROM Table where id = 1;

+-------+
| score |
+-------+
|     0 |
+-------+

the in causes this to fail but that is the next word in my table.

SELECT MATCH (Columns) AGAINST ('+Term +Ex' IN BOOLEAN MODE) as score FROM Table where id = 1;

+--------------------+
| score              |
+--------------------+
| 219.30206298828125 |
+--------------------+

I also tried CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB;, then updated my.cnf with innodb_ft_server_stopword_table='db/my_stopwords'. I restarted and ran:

show variables like 'innodb_ft_server_stopword_table';

which brought back:

+---------------------------------+---------------------------+
| Variable_name                   | Value                     |
+---------------------------------+---------------------------+
| innodb_ft_server_stopword_table | 'db/my_stopwords'; |
+---------------------------------+---------------------------+

so I thought the in would not cause the query to fail now but it continues. I also tried OPTIMIZE TABLE table again and even ALTER TABLE table DROP INDEX ... and ALTER TABLE table ADD FULLTEXT KEY ... none of which have had an affect.

Second Update The issue is with the stop words.

$userinput = preg_replace('/\b(a|about|an|are|as|at|be|by|com|de|en|for|from|how|i|in|is|it|la|of|on|or|that|the|this|to|was|what|when|where|who|will|with|und|the|www)\b/', '', $userinput);

resolves the issue but that doesn't appear as a good solution to me. I'd like a solution that avoids the stop words breaking this in mysql.

Stopword table data:

CREATE TABLE `my_stopwords` (
  `value` varchar(30) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1

and

Name: my_stopwords
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 0
 Avg_row_length: 0
    Data_length: 16384
Max_data_length: 0
   Index_length: 0
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2019-04-09 17:39:55
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment: 

Solution

  • There are several differences between MyISAM's FULLTEXT and InnoDB's. I think you were caught by the handling of 'short' words and/or stop words. MyISAM will show rows, but InnoDB will fail to.

    What I have done when using FT (and after switching to InnoDB) is to filter the user's input to avoid short words. It takes extra effort but gets me the rows desired. My case is slightly different since the resulting query is something like this. Note that I have added + to require the words, but not on words shorter than 3 (my ft_min_token_size is 3). These searches were for build a table and build the table:

    WHERE match(description) AGAINST('+build* a +table*' IN BOOLEAN MODE)
    WHERE match(description) AGAINST('+build* +the* +table*' IN BOOLEAN MODE)
    

    (The trailing * may be redundant; I have not investigated that.)

    Another approach

    Since FT is very efficient at non-short, non-stop words, do the search with two phases, each being optional: To search for "a long word", do

    WHERE MATCH(d) AGAINST ('+long +word' IN BOOLEAN MODE)
      AND d REGEXP '[[:<:]]a[[:>:]]'
    

    The first part whittles down the possible rows rapidly by looking for 'long' and 'word' (as words). The second part makes sure there is a word a in the string, too. The REGEXP is costly but will be applied only to those rows that pass the first test.

    To search just for "long word":

    WHERE MATCH(d) AGAINST ('+long +word' IN BOOLEAN MODE)
    

    To search just for the word "a":

    WHERE d REGEXP '[[:<:]]a[[:>:]]'
    

    Caveat: This case will be slow.

    Note: My examples allow for the words to be in any order, and in any location in the string. That is, this string will match in all my examples: "She was longing for a word from him."