This is something weird I can't really understand, I'm working in PostGreSql9.2 ...
I have this database:
movies (id, title, votes)
infos (id, movie_id, info_type, value)
I want to update movies.votes with infos.value, joining on movies.id = infos.movie_if and only where info_type = 100 (which is the type for votes..)
I tried 2 different queries:
update movies
set votes = cast(i.value as integer)
from movies m inner join infos i on m.id = i.movie_id
where i.info_type = 100
which (using explain) predicts a running time of about 11 million seconds (too much!)
second try:
update movies m
set votes = cast(
(
select value
from infos i
where i.info_type = 100 and i.movie_id = m.id
limit 1
) AS integer);
this one whould be "only" 20 thousands seconds.. still far too much
I don't really know how the query plan work, so I tries to do this with a ruby script (using active_record)... which is:
Info.find_in_batches(:conditions => "info_type = 100") do |group|
group.each{ |info|
movie = Movie.find(info.movie_id)
movie.votes = info.value.to_i
movie.save
}
end
For those of you who don't read ruby, this query simply loops thru all infos that meet the info_type = 100 condition, then for each one it searches the corresponding movie and updates it..
And it was very fast! just a few minutes, and with all the ruby/orm overhead!!
Now, why?? Know that movies is about 600k records, but only 200k (a third) have an info record with the number of votes. Still this doesn't explain what is happening.
EXPLAIN
As ruakh already explained, you probably misunderstood what EXPLAIN
is telling you. If you want actual times in seconds, use EXPLAIN ANALYZE
.
Be aware though, that this actually executes the statement. I quote the manual here:
Important: Keep in mind that the statement is actually executed when the
ANALYZE
option is used. AlthoughEXPLAIN
will discard any output that aSELECT
would return, other side effects of the statement will happen as usual. If you wish to useEXPLAIN ANALYZE
on anINSERT
,UPDATE
,DELETE
,CREATE
TABLE AS
, orEXECUTE
statement without letting the command affect your data, use this approach:BEGIN; EXPLAIN ANALYZE ...; ROLLBACK;
Still, the estimates for the first query are way to high and indicate a grave problem.
As to your third approach: for big tables, it will always be faster by an order of magnitude to have the database server update a whole (big) table at once than sending instructions to the server for every row - even more so if the new values come from within the database. More in this related answer. If your tests show otherwise, chances are that something is wrong with your (test) setup. And in fact, it is ...
Your first query goes wrong completely. The god-awful performance estimate is indication of how terribly wrong it is. While you join the table movies
to the table infos
in the FROM
clause, you forget the WHERE
condition to bind the resulting rows to the rows in the UPDATE
table. This leads to a CROSS JOIN
, i. e. every row in movies
(600k) is updated with every single vote in values
(200k) resulting in 120 000 000 000 updates. Yummy. And all wrong. Never execute this. Not even in a transaction that can be rolled back.
Your second query goes wrong, too. It runs a correlated subquery, i. e. it runs a separate query for every row. That's 600k subqueries instead of just 1, hence terrible performance.
That's right: 600k subqueries. Not 200k. You instruct Postgres to update every movie, no matter what. Those without a matching infos.value
(no info_type = 100
), receive a NULL
value in votes
, overwriting whatever was there before.
Also, I wonder what that LIMIT 1
is doing there?
(infos.movie_id, infos.info_type)
is UNIQUE
, then you don't need LIMIT
.UNIQUE
. Then add a UNIQUE index
to infos
if you intend to keep the structure.UPDATE movies m
SET votes = i.value::int
FROM infos i
WHERE m.id = i.movie_id
AND i.info_type = 100
AND m.votes IS DISTINCT FROM i.value::int;
This is much like your first query, just simplified and doing it right. Plus:
No need to join to movies
a second time. You only need infos
in the FROM
clause.
Actually bind the row to be updated to the row carrying the new value, thereby avoiding the (unintended) CROSS JOIN
:
WHERE m.id = i.movie_id
Avoid empty updates, they carry a cost for no gain. That's what the last line is for.
Should be a matter of seconds or less, not millions of seconds.
BTW, indexes will not help this query, table scans are faster for the described data distribution since you use all (or a third) of the rows in involved tables.