Search code examples
sphinxthinking-sphinx

Thinking Sphinx indexing performance


I have a large index definition that takes too long to index. I suspect the main problem is caused by the many LEFT OUTER JOINs generated.

I saw this question, but can't find documentation about using source: :query, which seems to be part of the solution.

My index definition and the resulting query can be found here: https://gist.github.com/jonsgold/fdd7660bf8bc98897612

How can I optimize the generated query to run faster during indexing?


Solution

  • This is the solution that worked best (from the linked question). Basically, you can remove a piece of the main query sql_query and define it separately as a sql_joined_field in the sphinx.conf file.

    It's important to add all relevant sql conditions to each sql_joined_field (such as sharding indexes by modulo on the ID). Here's the new definition:

    ThinkingSphinx::Index.define(
      :incident,
      with: :active_record,
      delta?: false,
      delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
    ) do
        indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
        ...
        has
        ...
    end
    
    ThinkingSphinx::Index.define(
      :incident,
      with: :active_record,
      delta?: true,
      delta_processor: ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
    ) do
        indexes "SELECT incidents.id * 51 + 7 AS id, sites.name AS site FROM incidents LEFT OUTER JOIN sites ON sites.id = site_id WHERE incidents.deleted = 0 AND incidents.delta = 1 AND EXISTS (SELECT id FROM accounts WHERE accounts.status = 'enabled' AND incidents.account_id = id) ORDER BY id", as: :site, source: :query
        ...
        has
        ...
    end
    

    The magic that defines the field site as a separate query is the option source: :query at the end of the line.

    Notice the core index definition has the parameter delta?: false, while the delta index definition has the parameter delta?: true. That's so I could use the condition WHERE incidents.delta = 1 in the delta index and filter out irrelevant records.

    I found sharding didn't perform any better, so I reverted to one unified index.

    See the whole index definition here: https://gist.github.com/jonsgold/05e2aea640320ee9d8b2.

    Important to remember!

    The Sphinx document ID offset must be handled manually. That is, whenever an index for another model is added or removed, my calculated document ID will change. This must be updated.

    So, in my example, if I added an index for a different model (not :incident), I would have to run rake ts:configure to find out my new offset and change incidents.id * 51 + 7 accordingly.