Table a
has around 8,000 rows and table b
has around 250,000 rows. Without the levenshtein
function the query takes just under 2 seconds. With the function included it is taking about 25 minutes.
SELECT
*
FROM
library a,
classifications b
WHERE
a.`release_year` = b.`year`
AND a.`id` IS NULL
AND levenshtein_ratio(a.title, b.title) > 82
I'm assuming that levenshtein_ratio
is a function that you wrote (or maybe included from somewhere else). If so, the database server would not be able to optimize that in the normal sense of using an index. So it means that it simply needs to call it for each record that results from the other join conditions. With an inner join, that could be an extremely large number with those table sizes (a maximum of 8000*250000 = 2 billion). You can check the total number of times it would need to be called with this:
SELECT
count(*)
FROM
library a,
classifications b
WHERE
a.`release_year` = b.`year`
AND a.`id` IS NULL
That is an explanation of why it is slow (not really an answer to the question of how to optimize it). To optimize it, you likely need to add additional limiting factors to the join condition to reduce the number of calls to the user-defined function.