Search code examples
sphinx

Configure Sphinx to handle space as possible words separator


Suppose I have a text Foo Bar Baz-Qux. How can I configure Sphinx's indexer so Sphinx be able to find match for any of given strings?

Foo Bar Baz-Qux
Foo BazQux Bar
Baz Qux Foo Bar

Currently I've a dash symbol as value of ignore_chars setting, and Sphinx gives me result for first two queries but not for third.

Please note that solution must be general and not rely on particular words from example or on their relevant order.

Thanks!


Solution

  • I have found a solution (or a workaround): use of regexp_filter.

    So Sphinx index config looks now like this:

    ...
    ignore_chars = -
    regexp_filter = \b([\w\d]+)-([\w\d]+)\b => \1\2 \1 \2
    ...
    

    So right before Sphinx will put text into its index it will split all dash-containing words into two forms: first one where dash is simply removed and second where dash replaced with a space. At the moment of index creation three words of text "Foo-Bar" will be indexed: "FooBar", "Foo" and "Bar". This lets me to search with any of the following queries: "Foo-Bar" (dash will be removed since it is in ignore_chars list), "FooBar" (this words is in the index) and "Foo Bar" (both words are in the index).

    The main problem here is that you cannot use exact phase match for both types of the queries at same time. I. e. if you search for "Bar BazQux" or "Bar Baz-Qux" you'll get a result. But for "Bar Baz Qux" you will get nothing. In my specific case it is not a issue, but for any who want to use this approach - I've warned you.

    If you know better way to do this thing, or this workaround has some disadvantages that I have missed, please let me know.


    Another possible solution is using of trigrams as shown here. This way also helps with possible user's mistakes but more difficult to implement.