Search code examples
sqlmysql

MySQL Query performance - best partial match


I have performance issue on a query, potentiall wrongly written.

Dear all. I have two tables. One table "test_cdr" , with rougly 5M rows, contains three columns

  • POINT_TARGET | Contains called phone numbers
  • dest | should contain a destination identifier
  • DURATION | the duration of the call

Another table "codes" contains two columns

  • area_codes | contains a list of country prefixes
  • dest | contains the destination identifier

Now I need to puplate the test_cdr.dest by:

  • run every number in test_cdr.POINT_TARGET against the codes.area_codes
  • identify the best matches codes.area_codes
  • take the data inside codes.dest and copy it into test_cdr

like this, I can then sum the duration of each call per destination type.

I am, atm, using this query

UPDATE test_cdr AS main
JOIN (
    SELECT c.POINT_TARGET, t.dest
    FROM test_cdr c
    JOIN codes t ON c.POINT_TARGET LIKE CONCAT(t.area_codes, '%')
    WHERE c.dest IS NULL  -- Only update rows where dest is NULL
    LIMIT 1000  -- Adjust the batch size as needed
) AS matched ON main.POINT_TARGET = matched.POINT_TARGET
SET main.dest = matched.dest;

It only works with LIMIT and it returns roughly 10000 rows (every 20 seconds - pretty lame performance) As soon as I remove LIMIT, the query runs but nothing is performed on the db (I see no disk activity, while with LIMIT there is disk activity).

Could you help me figure out the issue or maybe provide a better option? Btw, I have used INDEXES.

Thank you.


Solution

  • I'm not sure if this query improves the performance.
    Furthermore, it updates all CDRs having destination NULL and maybe that's not what you want.
    It does however take the longest area code in case of multiple matches.
    For instance, if the dialed number is +3906123456 it considers area code +3906 (Rome) instead of +390 (Italy fixed).
    Anyway, give it a try and please be gentle in case it's not useful.

    UPDATE test_cdr AS cdrs
       SET cdrs.dest = (SELECT cod1.dest
                        FROM   codes AS cod1
                        WHERE  cod1.area_codes = LEFT(cdrs.point_target, LENGTH(cod1.area_codes))
                          AND  NOT EXISTS (SELECT *
                                           FROM   codes AS cod2
                                           WHERE  cod2.area_codes = LEFT(cdrs.point_target, LENGTH(cod2.area_codes))
                                             AND  LENGTH(cod2.area_codes) > LENGTH(cod1.area_codes))
                        LIMIT 1)
     WHERE cdrs.dest IS NULL;
    

    I'm quite convinced that

    t.area_codes = LEFT(c.point_target, LENGTH(t.area_codes))
    

    is faster than

    c.point_target LIKE CONCAT(t.area_codes, '%')
    

    UPDATE:
    I suppose there's a unique key on codes.area_codes. In that case you can remove LIMIT 1 from the subquery since that query will never return more than 1 value.
    Furthermore, if it doesn't find the corresponding area code, the destination will be updated to NULL which causes the same dialed number to be re-examined the next time.