mysql join sql-order-by cumulative-sum sql-limit

MySQL – JOIN automatically executes irrelevant SELECT expressions

In trying to implement a cumulative sum query with JOIN, I've run into a surprising problem: the cumulative sum expression is sometimes executed unexpectedly over the entire table, in an unexpected order, and with no adherence to LIMIT.

Consider the following table schema:

CREATE TABLE t ( 
    v INT NOT NULL, 
    q_val INT
) ENGINE=InnoDB;
CREATE TABLE q ( val INT ) ENGINE=InnoDB;

INSERT INTO t VALUES 
    (5, 1), 
    (3, 1), 
    (1, 1), 
    (4, 1), 
    (2, 1);
INSERT INTO q VALUES (1);
-- (Table `q` exists solely for the purpose of the JOIN)

^{The data in t is entered out of order on purpose.}

This is the query in question:

SELECT
  v, @cumulative:=@cumulative+v
  FROM t
  [INNER/LEFT/RIGHT] JOIN q ON t.q_val=q.val
  CROSS JOIN (
      SELECT (@cumulative:=0)
  ) z
  ORDER BY t.v DESC
  LIMIT 2;

We should expect to get the result on the left, but oftentimes we get the result on the right:

+---+----------------------------+      +---+----------------------------+
| v | @cumulative:=@cumulative+v |      | v | @cumulative:=@cumulative+v |
+---+----------------------------+      +---+----------------------------+
| 5 |                          5 |  VS  | 5 |                          5 |
| 4 |                          9 |      | 4 |                         13 |
+---+----------------------------+      +---+----------------------------+

If we remove the LIMIT clause on the queries that exhibit unexpected behaviour, we see what is really happening:

+---+----------------------------+
| v | @cumulative:=@cumulative+v |
+---+----------------------------+
| 5 |                          5 |
| 4 |                         13 |
| 3 |                          8 |
| 2 |                         15 |
| 1 |                          9 |
+---+----------------------------+

Clearly, in these case:

The sum is calculated in the "default" order of the table — the order without an explicit ORDER BY statement and, in this case, the order of insertion — and
It is being executed within the JOIN loops before the rest of the query somehow, circumventing the LIMIT restriction.

The second behaviour is particularly confusing, since the column from which the summation terms come from isn't even the one in the ON clause.

There are two factors that appear contribute to which behaviour is exhibited: combinations and types of indices (regular vs. PRIMARY), and type of JOIN. I have tested the combinations with significant results and have compiled them below:

^{** Strangely, using INNER JOIN q FORCE INDEX(val) makes the "KEY(q.val) AND KEY(t.q_val)" behaviour expected for some reason

* I suspect SELECT STRAIGHT_JOIN automatically creates the unexpected result in all cases, but STRAIGHT_JOIN-type JOIN seems to automatically create the expected result for all key combinations.}

This begs the question: why is this happening? Are there other cases that exist? And, since this "default" order is unpredictable and dangerous, can this behaviour be consistently avoided?

SQL Fiddle

Solution

Mixing a side-effect accumulator, an ORDER BY, and a LIMIT clause yield unpredictable results. That's because the ORDER BY operates on the resulting rows after the accumulator result set is generated. MySQL, and indeed all SQL table servers, return rows in formally unpredictable order in the absence of ORDER BY. See, SQL Fiddle has the same problem you showed. (http://sqlfiddle.com/#!9/44007/4/0)

You can control the result by generating the resultset in a predictable order in a subquery, then using the (nasty MySQL hack of a) side-effect accumulator. Like so. (http://sqlfiddle.com/#!9/44007/10/0)

SELECT v, @cumulative := @cumulative+v
FROM (
  SELECT t.v
    FROM t
    LEFT JOIN q ON t.q_val=q.val
   ORDER BY t.v DESC
) a
JOIN (SELECT @cumulative := 0) b
LIMIT 2

What is a side-effect accumulator?

The pattern in the query of this question involving @cumulative is the side-effect accumulator. It's called that because @cumulative := @cumulative+v declares the contents of a result-set column and also produces a side effect (incrementing the variable).

SQL is a declarative language at its heart. A query, in its platonic-ideal form, doesn't say how to generate the result, it says what to generate. In the real world, SQL is littered up with pragmas and hints and side-effects.

The problem with such things is that they often depend on the internal details of how the server plans and carries out queries. But, in a declarative language implemented by a competent query planner those details are intentionally obscure and unpredictable. The query planner is allowed (for efficiency's sake) to do anything it wants, in any order it wants, as long as it yields the correct result at the end. So it can carry out the side-effect computations however it pleases.

This questioner got caught by the obscurity of the sequence of row-generation and ordering operations, and the unpredictability of row ordering in results.

Notice that many of us would much rather have the MySQL and MariaDB teams spend their time implementing the ranking and windowing features available in other table servers than doing a lot of work to make these side-effect queries more predictable. Ranking and windowing would give us declarative ways to generate the sort of result this questioner wants.