Search code examples
mysqllevenshtein-distance

How to optimize this Levenshtein distance calculation


Table a has around 8,000 rows and table b has around 250,000 rows. Without the levenshtein function the query takes just under 2 seconds. With the function included it is taking about 25 minutes.

SELECT
      *
   FROM
      library a,
      classifications b
   WHERE  
      a.`release_year` = b.`year`
      AND a.`id` IS NULL
      AND levenshtein_ratio(a.title, b.title) > 82

Solution

  • I'm assuming that levenshtein_ratio is a function that you wrote (or maybe included from somewhere else). If so, the database server would not be able to optimize that in the normal sense of using an index. So it means that it simply needs to call it for each record that results from the other join conditions. With an inner join, that could be an extremely large number with those table sizes (a maximum of 8000*250000 = 2 billion). You can check the total number of times it would need to be called with this:

    SELECT
          count(*)
       FROM
          library a,
          classifications b
       WHERE  
          a.`release_year` = b.`year`
          AND a.`id` IS NULL
    

    That is an explanation of why it is slow (not really an answer to the question of how to optimize it). To optimize it, you likely need to add additional limiting factors to the join condition to reduce the number of calls to the user-defined function.