Search code examples
phpmysqlsphinx

Searching on multiple fulltext fields with Sphinx


I am trying to use sphinxsearch to search on multiple fields, in essence to get around the restriction on numeric IDs used in attributes for search filtering (the database uses a lot of alphanumeric uniqIDs as ids instead).

Here's the main search used in the Sphinx config:

sql_query               = \
        SELECT text_page.id, text_page.document_id, documents.startdate, documents.enddate, documents.long_title, documents.volume,text_page.images_page_id, text_page.text, \
        series.name, series.id AS series_id, series.white_label_id AS white_label_id, \
        documents.date_created\
        FROM text_page \
        INNER JOIN documents ON text_page.document_id = documents.id \
        INNER JOIN series ON documents.series_id = series.id

text_page.text is the main fulltext field.

I have added this line to the config to try to get this row fulltext indexed as well:

sql_field_string = white_label_id

I then tried to create a query narrowed by white_label_id by running the following query through the PHP Sphinx class.

"@text (search words) @white_label_id (some-uniq-id)"

As I understand it from here, this should mean both @text and @white_label_id have to produce hits on the database row to return a result.

However the query produces no results ever, and no errors or warnings.

Any suggestion as to what is going wrong here? Is it because white_label_id and text fields are on different tables? Is there a solution that avoids restructuring the database to use numeric IDs?

Edited:

As requested, here is a full config file. Note at present the code is still using the PHP Sphinx Class, rather than SphinxQL via mysqli.

source src2
{

    sql_host                = localhost
    sql_user                = username
    sql_pass                = password
    sql_db                  = databasename
    sql_port                = 3306  # optional, default is 3306 
    sql_query_pre           = SET NAMES utf8

    sql_query               = \
        SELECT text_page.id, text_page.document_id, documents.startdate, documents.enddate, documents.long_title, documents.volume,text_page.images_page_id, text_page.text, \
        series.name, series.id AS series_id, series.white_label_id AS white_label_id, \
        documents.date_created\
        FROM text_page \
        INNER JOIN documents ON text_page.document_id = documents.id \
        INNER JOIN series ON documents.series_id = series.id

    
    sql_attr_uint                   = startdate
    sql_attr_uint                   = enddate
    sql_attr_uint                   = volume

    sql_attr_timestamp      = date_created


    sql_attr_string     = long_title
    sql_attr_string     = name
    #sql_attr_string        = white_label_id #NB - does not work with nonnumeric ids
    sql_attr_string     = document_id
    sql_attr_string     = series_id

    sql_field_string = white_label_id  #currently appears to do nothing

    
    sql_ranged_throttle = 0 
}

source src2throttled : src2
{
    sql_ranged_throttle         = 100
}

index myindex11
{
    
    source          = src2
    path            = /var/data/mydata1
    docinfo         = extern
    mlock           = 0
    morphology      = none
    min_word_len        = 1
    charset_type        = utf-8
    html_strip              = 0

}

index myindex1stemmed : myindex1
{
    path            = /var/data/mydata1stemmed
    morphology      = stem_en
    index_exact_words   = 1
}

Solution

  • Eventually it turns out there's a much better solution to working around the 'numeric only' rule on Sphinx column ids.

    The answer is to create a numeric hash of text-based uniq_id columns, which can then be used as sql_attr_uint to narrow searches.

    For example, the SQL query in the original post becomes:

    sql_query               = \
            SELECT text_page.id, text_page.document_id, documents.startdate, documents.enddate, documents.long_title, documents.volume,text_page.images_page_id, text_page.text, \
            series.name, series.id AS series_id, CRC32(series.white_label_id) AS white_label_id, \
            documents.date_created\
            FROM text_page \
            INNER JOIN documents ON (text_page.document_id = documents.id AND documents.is_active = 1) \
            INNER JOIN series ON documents.series_id = series.id