Search code examples
mysqlsqlquery-performance

Generate a join similar to a vlookup based on closest date


I have the following two tables:

movie_sales (provided daily)

  • movie_id
  • date
  • revenue

movie_rank (provided every few days or weeks)

  • movie_id
  • date
  • rank

The tricky thing is that every day I have data for sales, but only data for ranks once every few days. Here is an example of sample data:

`movie_sales`
- titanic (ID), 2014-06-01 (date), 4.99 (revenue)
- titanic (ID), 2014-06-02 (date), 5.99 (revenue)

`movie_rank`
- titanic (ID), 2014-05-14 (date), 905 (rank)
- titanic (ID), 2014-07-01 (date), 927 (rank)

And, because the movie_rate.date of 2014-05-14 is closer to the two sales dates, the output should be:

id         date             revenue           closest_rank
titanic    2014-06-01       4.99               905
titanic    2014-06-02       5.99               905

The following query works to get the results by getting the min date difference in the sub-select:

SELECT
    id,
    date,
    revenue,
    (SELECT rank from movie_rank where id=s.id ORDER BY ABS(DATEDIFF(date, s.date)) ASC LIMIT 1)
FROM
    movie_sales s

But I'm afraid that this would have terrible performance as it will literally be doing millions of subselects...on millions of rows. What would be a better way to do this, or is there really no proper way to do this since an index can not be properly done with a DATEDIFF ?


Solution

  • Unfortunately, you are right. The movie rank table must be searched for each movie sale and of all matching movie rows the closest be picked.

    With an index on movie_rank(id) the DBMS finds the movie rows quickly, but an index on movie_rank(id, date) would be better, because the date could be read from the index and only the one best match would be read from the table.

    But you also say that there are new ranks every few dates. If it is guaranteed to find a rank in a certain range, e.g. for each date there will be at least one rank in the twenty days before and at least one rank in the twenty days after, you can limit the search accordingly. (The index on movie_rank(id, date) would be essential for this, though.)

    SELECT
      id,
      date,
      revenue,
      (
        select r.rank 
        from movie_rank r
        where r.id = s.id
        and r.date between s.date - interval 20 days
                       and s.date + interval 20 days
        order by abs(datediff(date, s.date)) asc
        limit 1
      )
    FROM movie_sales s;