Search code examples
phpmysqlmatchsphinx

Using a title to determin possible categories in SphinxQL


I have a database with over 60 million records indexed by SphinxQL 2.1.1. Each record has a title and a catid (among other things). When a new record is inserted into the database, I am trying to get sphinx to guess the catid based on the text in the title.

I have managed to get it working for single words like so:

SELECT @groupby, catid, count(*) c FROM sphinx WHERE MATCH('*LANDLORDS*') group by catid order by c desc

However the actual title is likely to be something like this:

Looking for Landlords - Long term lease - No fees!!!

Is there any way to just dump the whole title string into sphinx and have it break down each of the words and perform some sort of fuzzy match, returning the most likely category?


Solution

  • Well as such sphinx isnt 'magical', and it doesn't have a 'fuzzy match' function.

    But can approximate one :) Two main steps...

    1. Changing from requiring all 'words', to just requiring some,

    2. changing ranking, to try to make the best 'intersection' between the query and the title, get a high weight, and therefore 'bubble' to the top.

    Can then just take the top result, and take it be a 'best guess'.

    (there is actully a third, words lie 'for' and 'the' are likly to cause lots of false positives, so may want to exclude them, either using stopwords on the index, or just strip then from the query)

    A prototype of such a query might be something like

    SELECT catid FROM sphinx WHERE MATCH('"Looking Landlords Long term lease No fees"/1') OPTION ranker=wordcount LIMIT 1;
    

    Thats using quorum to affect matching, and choosing a different ranker.


    Using this version with grouping, proabbly wont work, as will include lots of low quality matches. Although could perhap try using avg, or sum to get a composite weight?

    SELECT SUM(WEIGHT()) as w, catid FROM sphinx WHERE MATCH('"Looking Landlords Long term lease No fees"/1') GROUP BY catid ORDER BY w DESC OPTION ranker=wordcount LIMIT 1
    

    There are lots of ways to tweak this...

    You can try other rankers, eg matchany. Or even some custom ranking expressions.

    Or change the quorum, eg rather rank requiring 1 word, could result at least a few.

    Or if can extract phrases, eg

    '"Looking Landlords" | "Long term lease" | "No fees"'

    might work?

    ALso could rather than just taking the top result, take the top 5-10 results, and show them all to the user, compenstates for the fact the results are very approximate.