Search code examples
ruby-on-railsrubyfull-text-searchsphinx

Wrong search results with Sphinx - Ruby on Rails


We ask for your help because we are really stuck :-(

We made a big uppgrade on one of our product that uses Sphinx search

Search was always working fine before, but now after the upgrade results are absolutly wrong and after many days of downgrade etc, we are not able to solve it.

Search with or without accents should returns hundred of results, but only returns few results now. Search results are totaly wrong, accented chars seems to be replace be nothing, as if charset_table was ignored.

In order to obtain good results for "hopital" or "hôpital" we have to type "hpital" ....

Of course we use a charset_table, reindex all tables, use UTF8 etc..

Before we had a working search with :

  • Ruby on Rails 1.9.3
  • Sphinx 2.0.10
  • Riddle 1.5.12
  • Thinking Sphinx 3.1.4
  • Mysql 5.5.52

Our broken config is :

  • Ruby on Rails 2.0.0
  • Sphinx 2.2.11
  • Riddle 2.0.0
  • Thinking Sphinx 3.1.4
  • Mysql 5.5.52

Thanks in advance for all your feedback


Solution

  • Not sure know enough to suggest how to fix it, but might be able to explain it.

    Sphinx has a rewritten tokenizer, that responds differently to invalid UTF8 sequences.

    Previously invalid sequences would just become 'seperators', so its entirely possible the search worked, because "hôpital" would simply be indexed as "h pital", the query would do the same, and so 'match'.

    But the new tokenizer, 'drops' invalid sequences, so if "hôpital" is received 'mangled' in someway it gets indexed as hpital - the invalid bytes are gone.

    (the query parser hasnt changed, so now behaves inconistenly with text parsing)

    So if somehow indexing of UTF data is not completely correct, then the behaviour will have changed. Just wasn't noticed before because it was consistently wrong :)

    So perhaps making sure sphinx is receiving data correct out of the database, MAY fix it? Something like SET NAMES. If the data is received valid at sphinx, then it should index ok, as per charset_table.