Search code examples
sqlpostgresqlsql-execution-plan

how to make join table update efficient in sql (it's efficient from my ruby code!!)


This is something weird I can't really understand, I'm working in PostGreSql9.2 ...

I have this database:

movies (id, title, votes)
infos (id, movie_id, info_type, value)

I want to update movies.votes with infos.value, joining on movies.id = infos.movie_if and only where info_type = 100 (which is the type for votes..)

I tried 2 different queries:

update movies
set votes = cast(i.value as integer)
from movies m inner join infos i on m.id = i.movie_id
where i.info_type = 100

which (using explain) predicts a running time of about 11 million seconds (too much!)

second try:

update movies m
set votes = cast(
(
  select value
  from infos i
  where i.info_type = 100 and i.movie_id = m.id
  limit 1
) AS integer);

this one whould be "only" 20 thousands seconds.. still far too much

I don't really know how the query plan work, so I tries to do this with a ruby script (using active_record)... which is:

Info.find_in_batches(:conditions => "info_type = 100") do |group|
    group.each{ |info|
        movie = Movie.find(info.movie_id)
        movie.votes = info.value.to_i
        movie.save
    }
end

For those of you who don't read ruby, this query simply loops thru all infos that meet the info_type = 100 condition, then for each one it searches the corresponding movie and updates it..

And it was very fast! just a few minutes, and with all the ruby/orm overhead!!

Now, why?? Know that movies is about 600k records, but only 200k (a third) have an info record with the number of votes. Still this doesn't explain what is happening.


Solution

  • EXPLAIN

    As ruakh already explained, you probably misunderstood what EXPLAIN is telling you. If you want actual times in seconds, use EXPLAIN ANALYZE.

    Be aware though, that this actually executes the statement. I quote the manual here:

    Important: Keep in mind that the statement is actually executed when the ANALYZE option is used. Although EXPLAIN will discard any output that a SELECT would return, other side effects of the statement will happen as usual. If you wish to use EXPLAIN ANALYZE on an INSERT, UPDATE, DELETE, CREATE TABLE AS, or EXECUTE statement without letting the command affect your data, use this approach:

    BEGIN;
    EXPLAIN ANALYZE ...;
    ROLLBACK;
    

    Still, the estimates for the first query are way to high and indicate a grave problem.

    What's wrong?

    As to your third approach: for big tables, it will always be faster by an order of magnitude to have the database server update a whole (big) table at once than sending instructions to the server for every row - even more so if the new values come from within the database. More in this related answer. If your tests show otherwise, chances are that something is wrong with your (test) setup. And in fact, it is ...

    Your first query goes wrong completely. The god-awful performance estimate is indication of how terribly wrong it is. While you join the table movies to the table infos in the FROM clause, you forget the WHERE condition to bind the resulting rows to the rows in the UPDATE table. This leads to a CROSS JOIN, i. e. every row in movies (600k) is updated with every single vote in values (200k) resulting in 120 000 000 000 updates. Yummy. And all wrong. Never execute this. Not even in a transaction that can be rolled back.

    Your second query goes wrong, too. It runs a correlated subquery, i. e. it runs a separate query for every row. That's 600k subqueries instead of just 1, hence terrible performance.

    That's right: 600k subqueries. Not 200k. You instruct Postgres to update every movie, no matter what. Those without a matching infos.value (no info_type = 100), receive a NULL value in votes, overwriting whatever was there before.

    Also, I wonder what that LIMIT 1 is doing there?

    • Either (infos.movie_id, infos.info_type) is UNIQUE, then you don't need LIMIT.
    • Or it isn't UNIQUE. Then add a UNIQUE index to infos if you intend to keep the structure.

    Proper query

    UPDATE movies m
    SET    votes = i.value::int
    FROM   infos i
    WHERE  m.id = i.movie_id
    AND    i.info_type = 100
    AND    m.votes IS DISTINCT FROM i.value::int;
    

    This is much like your first query, just simplified and doing it right. Plus:

    • No need to join to movies a second time. You only need infos in the FROM clause.

    • Actually bind the row to be updated to the row carrying the new value, thereby avoiding the (unintended) CROSS JOIN:

        WHERE  m.id = i.movie_id
      
    • Avoid empty updates, they carry a cost for no gain. That's what the last line is for.

    Should be a matter of seconds or less, not millions of seconds.

    BTW, indexes will not help this query, table scans are faster for the described data distribution since you use all (or a third) of the rows in involved tables.