Search code examples
phpmysqlsphinxkeyword

Generating ranked keywords (based on frequency) from mysql or sphinx


I just came across the indexer --buildstopwords dict.txt 100000 .... feature in sphinx search which creates a stopwords file containing the most frequently used terms in a specified index. Here's a sample of the output from one of my indexes:

bedroom 974428
sale 888884
for 884699
house 515221
delivery 464002
day 462239
same 406905
5dr 364648

However, rather than doing it for the entire database I'd like to generate lists for individual categories. Something that does not seem possible with the sphinx indexer.

Is there any other way that I can generate a keywords list from a title field in mysql and have the results returned as an ordered list ranked popularity?

One possible issue with any answer is that my title table in mysql is not indexed (as sphinx takes care of my searching) and i have over 10m records.


Solution

  • Well as such 'all' sphinx does with the buildstops function, is run the indexing query 'sql_query', and the split the results into words (as per is normal tokenizing rules) and culculates the freqeuencs.

    As such it doesnt matter if indexed or not, you just retriving all rows.

    ... such a function would be relatively easy to replicate yourself. Run a query to get all the titles, split into words, and sum the results.


    But possibly you might stuggle to make it efficident as sphinx (because thats a compiled C program with generally pretty smart developers ;). But if you can run it as an overnight batch process, might not matter how slow and clunky your version is.


    If efficiency is important, could create a fake sphinx index (ie it never gets indexed) that could be used,

    eg create a special table with one row - category_id. and the index uses that to filter the results (A table is just a hacky way to pass a 'variable' to sphinx index)

    sql_query = SELECT id, title FROM table INNER JOIN sphinx_category USING (category_id)
    

    then a small script

    <?php 
    $cats = getCol("SELECT category_id FROM categories");
    foreach ($cats as $cat) {
      query("UPDATE sphinx_category SET category_id = $cat");
      `indexer --buildstops filtered_index > words$cat.txt`;
    }