A query executes concurrently to atomically increment some counters with an aggregate numeric value:
WITH data (
id,
delta
) AS (
VALUES(1,1.2),(2,2.0),(2,0.5)
), agg_data AS (
SELECT
id::bigint,
SUM(delta::numeric) AS sum_delta
FROM
data
GROUP BY
id
ORDER BY
id
)
UPDATE
counters
SET
val = counters.val + agg_data.sum_delta
FROM
agg_data
WHERE
counters.id = agg_data.id;
The order of the id
column in the CTE agg_data
is set, but the UPDATE statement does not follow it and is updating rows in counters
in arbitrary order, which causes deadlocks. Unfortunately, ORDER BY is not supported in UPDATE (wondering if there is a reason for that).
While I could add a sub query which would SELECT FOR UPDATE
, it feels like I'm over complicating things:
WITH data (
id,
delta
) AS (
VALUES(1,1.2),(2,2.0),(2,0.5)
), agg_data AS (
SELECT
id::bigint,
SUM(delta::numeric) AS sum_delta
FROM
data
GROUP BY
id
ORDER BY
id
), locked_counters AS ( -- add this part to lock in sequential order
SELECT
FROM
counters
WHERE
id
IN (SELECT id FROM agg_data)
ORDER BY
id
FOR NO KEY UPDATE)
UPDATE
counters
SET
val = counters.val + agg_data.sum_delta
FROM
agg_data
WHERE
counters.id = agg_data.id;
Considering that my counters
table could be very big and this query needs to be fully optimized for efficiency and integrity - is there a better way to go about this?
Sadly, there is no ORDER BY
for UPDATE
. We have to take explicit row-level locks with one of the SELECT FOR UPDATE
family of tools. And FOR NO KEY UPDATE
after ORDER BY
in a separate CTE seems like the best choice. You are on the right track.
Since ...
this query needs to be fully optimized for efficiency and integrity
I have a couple of suggestions:
WITH data (id, delta) AS (
VALUES (null::bigint, null::numeric) -- ①
UNION ALL
VALUES (1,1.2), (2,2.0), (2,0.5)
)
, agg_data AS ( -- ②
SELECT d.*
FROM (
SELECT id, sum(delta) AS sum_delta
FROM data
GROUP BY id
ORDER BY id -- optional
) d
JOIN counters USING (id) -- ③
ORDER BY id
FOR NO KEY UPDATE OF counters -- ②
)
UPDATE counters c
SET val = c.val + a.sum_delta
FROM agg_data a
WHERE c.id = a.id;
① Don't let input values default to some data type, only to cast them later. That's wasted effort (with the potential of introducing rounding errors). Instead, cast input to the right type immediately. You could do that by casting values of the first (or any) row in the VALUES
expression:
VALUES (bigint '1', numeric '1.2'), (2,2.0), (2,0.5)
That may be inconvenient as it forces you to edit your input. The next best thing is something like I did above. Then you have an immutable, leading row of null
values of the right data types forcing the rest of the set to fall in line. The dummy row is eliminated automatically in the subsequent join. For details, see:
② You had a CTE agg_data
to materialize the result of the aggregate, and another CTE locked_counters
to take row locks in order. You can do both in a single CTE. However, quoting the manual:
Currently,
FOR NO KEY UPDATE
,FOR UPDATE
,FOR SHARE
andFOR KEY SHARE
cannot be specified withGROUP BY
.
Postgres would raise an exception before it realizes that locks are not applicable for that derived table anyway. A shortcoming, but we can work around it with FOR NO KEY UPDATE OF counters
.
③ JOIN
is typically faster than IN (SELECT ...)
. IN
tries to fold duplicate from the subquery. After GROUP BY id
, there cannot be any dupes, but Postgres would not know and still follow a more expensive code path.
Also, the JOIN
immediately removes input rows without match in the target, which can be a notable gain if there are many.
In theory this should be a bit faster. You'll have to test.