I'm trying to do sort of a matching query to match recording from table A to table B using multiple columns. But for some reason the query is running very slow. I tried to experiment with different index combinations for the joined table but they are not being used, which is why it's always doing a full table scan.
SELECT
nc.id, nc.firstName, nc.lastName,
nc.firmName, nc.location,
nc.city AS city, nc.state AS state,
ac.id, ac.lastName, ac.firstName, ac.middleName,
IFNULL(ac.suffixName, '') AS suffixName, ac.firmName, ac.city, ac.state
FROM
NormalContacts AS nc
JOIN
AllContacts AS ac ON ((nc.firstName = ac.firstName AND nc.lastName = fa.lastName) OR (nc.firstName = fa.middleName AND nc.lastName = ac.lastName))
AND (ac.city = nc.city AND ac.state = nc.state)
JOIN
FirmInputTable AS fit ON (fit.firmName = fa.firmName AND fit.otherFirmName = nc.firmName)
WHERE
nc.crdNumber IS NULL AND nc.city IS NOT NULL AND nc.state IS NOT NULL AND nc.firmName IS NOT NULL
NormalContacts
is 1,000 records while AllContacts
is 337,250 records. All fields in the JOIN
condition are varchar.
Here is the EXPLAIN
result:
*fa in the screenshot AllContacts
, sorry for the typo.
Hope anyone can suggest a way to optimize this query or tell me what I'm doing wrong. Thanks in advance!
The Optimizer will look at indexes and statistics to decide what order to do the the tables.
In the following, I will assume that ON
says only how the tables relate to each other and WHERE
is only for filtering.
The "first" table will probably (but not necessarily) be the one with the most selective WHERE
condition(s). So, it's INDEX
needs to focus on the columns in WHERE
. (Sometimes GROUP BY
or ORDER BY
come into play.)
The other tables in the JOIN
sequence will be reached into via "Nested Loop Join". This is where "for each row in the previous table(s), fetch the row(s) in the current table". To do this fetch it will use any WHERE
clauses relevant to the table and any thing in the ON
that mentions this table (and probably previous tables). So, think about the INDEX
for this table from the WHERE
and ON
conditions.
Since you cannot always predict what order the Optimizer will walk through the tables, it is best to add index(es) to accommodate each possible order.
A typical shortcut is when WHERE
mentions only one table. That table is almost certain to be picked first.
How to build a good index for a particular table, given the WHERE
/ON
things leading into it? See my cookbook
For your case...
You seem to have properly split the ONs
and WHEREs
-- Good.
nc
seems to be the only table mentioned in WHERE
, so we can probably assume that the Optimizer will start with it.
IS NULL
is similar to = constant
, but IS NOT NULL
is like a range, not not as easy to optimize. I recommend
nc: INDEX(crdNumber, state)
(Here, I am guessing at which nc
column is most/least likely to be NULL
.)
After nc
, only ac
(aka fa
???) could come:
ON ((...) OR (...))
AND ac.city = nc...
AND ac.state = nc...
OR
is usually not possible to index or optimize, so we are left with
ac: INDEX(city, state) -- in either order
Possibly lastName
could be pulled out (after fixing ac/fa) of the OR
, thereby leading to
ac: INDEX(city, state, firstName) -- in any order
Finally, fit
:
ON fit.firmName = ...
AND fit.otherFirmName = ...
leads to
fit: INDEX(firmName, otherFirmName) -- in either order
Please fix the fa
in the query; then I will modify my answer if needed.
Note: INDEX(a,b)
is better than INDEX(a), INDEX(b)
in cases like these.