Search code examples
mysqlsqlauto-responder

Optimize SQL: Customers that haven't ordered for x days


I have created this SQL in order to find customers that haven't ordered for X days.

It is returning a result set, so this post is mainly just to get a second opinion on it, and possible optimizations.

SELECT o.order_id,
       o.order_status,
       o.order_created,
       o.user_id,
       i.identity_firstname,
       i.identity_email,

  (SELECT COUNT(*)
   FROM orders o2
   WHERE o2.user_id=o.user_id
     AND o2.order_status=1) AS order_count,

  (SELECT o4.order_created
   FROM orders o4
   WHERE o4.user_id=o.user_id
     AND o4.order_status=1
   ORDER BY o4.order_created DESC LIMIT 1) AS last_order
FROM orders o
INNER JOIN user_identities ui ON o.user_id=ui.user_id
INNER JOIN identities i ON ui.identity_id=i.identity_id
   AND i.identity_email!=''
INNER JOIN subscribers s ON i.identity_id=s.identity_id
  AND s.subscriber_status=1
  AND s.subsriber_type=e
  AND s.subscription_id=1
WHERE DATE(o.order_created) = "2013-12-14"
  AND o.order_status=1
  AND o.user_id NOT IN
    (SELECT o3.user_id
     FROM orders o3
     WHERE o3.user_id=o.user_id
       AND o3.order_status=1
       AND DATE(o3.order_created) > "2013-12-14")

Can you guys find any potential problems with this SQL? Dates are dynamically inserted.

The final SQL that I put in production, will basically only include o.order_id, i.identity_id and o.order_count - this order_count will need to be correct. The other selected fields and 'last_order' subquery will not be included, it's only for testing.

This should give me a list of users that have their last order on that particular day, and is a newsletter subscriber. I am particular in doubt about correctness of the NOT IN part in the WHERE clause, and the order_count subquery.


Solution

  • There are several problems:

    A. Using functions on indexable columns

    You are searching for orders by comparing DATE(order_created) with some constant. This is a terrible idea, because a) the DATE() function is executed for every row (CPU) and b) the database can't use an index on the column (assuming one existed)

    B. Using WHERE ID NOT IN (...)

    Using a NOT IN (...) is almost always a bad idea, because optimizers usually have trouble with this construct, and often get the plan wrong. You can almost always express it as an outer join with a WHERE condition that filters for misses using an IS NULL condition for a joined column (and adds the side benefit of not needing DISTINCT, because there's only ever one miss returned)

    C. Leaving joins that filtering out of large portions of rows too late

    The earlier you can mask off rows by not making joins the better. You can do this by joining less likely to match tables earlier in the joined table list, and by putting non-key conditions into join rather than the where clause to get the rows excluded as early as possible. Some optimizers to this anyway, but I've often found they don't

    D. Avoid correlated subqueries like the plague!

    You have several correlated subqueries - ones that are executed for every row of the main table. That's really an incredibly bad idea. Again sometimes the optimizer can craft them into a join, but why rely (hope) on that. Most correlated subqueries can be expressed as a join; you examples are no exception.

    With the above in mind, there are some specific changes:

    • o2 and o4 are the same join, so o4 may be dispensed with entirely - just use o2 after conversion to a join
    • DATE(order_created) = "2013-12-14" should be written as order_created between "2013-12-14 00:00:00" and "2013-12-14 23:59:59"

    This query should be what you want:

    SELECT
        o.order_id,
        o.order_status,
        o.order_created,
        o.user_id,
        i.identity_firstname,
        i.identity_email,
        count(o2.user_id) AS order_count,
        max(o2.order_created) AS last_order
    FROM orders o
    LEFT JOIN orders o2 ON o2.user_id = o.user_id AND o2.order_status=1
    LEFT JOIN orders o3 ON o3.user_id = o.user_id 
        AND o3.order_status=1
        AND o3.order_created >= "2013-12-15 00:00:00"
    JOIN user_identities ui ON o.user_id=ui.user_id
    JOIN identities i ON ui.identity_id=i.identity_id AND i.identity_email != ''
    JOIN subscribers s ON i.identity_id=s.identity_id
      AND s.subscriber_status=1
      AND s.subsriber_type=e
      AND s.subscription_id=1
    WHERE o.order_created between "2013-12-14 00:00:00" and "2013-12-14 23:59:59"
    AND o.order_status=1
    AND o3.order_created IS NULL -- This gets only missed joins on o3
    GROUP BY
        o.order_id,
        o.order_status,
        o.order_created,
        o.user_id,
        i.identity_firstname,
        i.identity_email;
    

    The last line is how you achieve the same as NOT IN (...) using a LEFT JOIN

    Disclaimer: Not tested.