Search code examples
mysqlinnodb

Index for a table with multiple JOIN varchar conditions


I'm trying to do sort of a matching query to match recording from table A to table B using multiple columns. But for some reason the query is running very slow. I tried to experiment with different index combinations for the joined table but they are not being used, which is why it's always doing a full table scan.

SELECT
    nc.id, nc.firstName, nc.lastName, 
    nc.firmName, nc.location,
    nc.city AS city, nc.state AS state, 
    ac.id, ac.lastName, ac.firstName, ac.middleName, 
    IFNULL(ac.suffixName, '') AS suffixName, ac.firmName, ac.city, ac.state
FROM
    NormalContacts AS nc
JOIN
    AllContacts AS ac ON ((nc.firstName = ac.firstName AND nc.lastName = fa.lastName) OR (nc.firstName = fa.middleName AND nc.lastName = ac.lastName))
        AND (ac.city = nc.city AND ac.state = nc.state)
JOIN
    FirmInputTable AS fit ON (fit.firmName = fa.firmName AND fit.otherFirmName = nc.firmName)       
WHERE
    nc.crdNumber IS NULL AND nc.city IS NOT NULL AND nc.state IS NOT NULL AND nc.firmName IS NOT NULL

NormalContacts is 1,000 records while AllContacts is 337,250 records. All fields in the JOIN condition are varchar.

Here is the EXPLAIN result:

enter image description here

*fa in the screenshot AllContacts, sorry for the typo.

Hope anyone can suggest a way to optimize this query or tell me what I'm doing wrong. Thanks in advance!


Solution

  • The Optimizer will look at indexes and statistics to decide what order to do the the tables.

    In the following, I will assume that ON says only how the tables relate to each other and WHERE is only for filtering.

    The "first" table will probably (but not necessarily) be the one with the most selective WHERE condition(s). So, it's INDEX needs to focus on the columns in WHERE. (Sometimes GROUP BY or ORDER BY come into play.)

    The other tables in the JOIN sequence will be reached into via "Nested Loop Join". This is where "for each row in the previous table(s), fetch the row(s) in the current table". To do this fetch it will use any WHERE clauses relevant to the table and any thing in the ON that mentions this table (and probably previous tables). So, think about the INDEX for this table from the WHERE and ON conditions.

    Since you cannot always predict what order the Optimizer will walk through the tables, it is best to add index(es) to accommodate each possible order.

    A typical shortcut is when WHERE mentions only one table. That table is almost certain to be picked first.

    How to build a good index for a particular table, given the WHERE/ON things leading into it? See my cookbook

    For your case...

    You seem to have properly split the ONs and WHEREs -- Good.

    nc seems to be the only table mentioned in WHERE, so we can probably assume that the Optimizer will start with it.

    IS NULL is similar to = constant, but IS NOT NULL is like a range, not not as easy to optimize. I recommend

    nc:  INDEX(crdNumber, state)
    

    (Here, I am guessing at which nc column is most/least likely to be NULL.)

    After nc, only ac (aka fa???) could come:

    ON    ((...) OR (...))
      AND ac.city = nc...
      AND ac.state = nc...
    

    OR is usually not possible to index or optimize, so we are left with

    ac:  INDEX(city, state)  -- in either order
    

    Possibly lastName could be pulled out (after fixing ac/fa) of the OR, thereby leading to

    ac:  INDEX(city, state, firstName)  -- in any order
    

    Finally, fit:

    ON    fit.firmName = ...
      AND fit.otherFirmName = ...
    

    leads to

    fit:  INDEX(firmName, otherFirmName)  -- in either order
    

    Please fix the fa in the query; then I will modify my answer if needed.

    Note: INDEX(a,b) is better than INDEX(a), INDEX(b) in cases like these.