Search code examples
sqldb2memory-efficient

Efficient way to update db2 database rows


I have a table with 92 million rows. I have a list of 4000 IDs from that table which need data updating. I put the 4000 IDs into their own table and tried running the following:

update clients
set col1='1', col2='y'
where id in
(select id from idstoupdate)

But this falls over due to memory constraints. So I tried splitting the 4000 IDs into 4 table each with 1000, and its still falling over if I try it on those smaller tables. Whats the most efficient way to deal with such a large table?

Thanks.


Solution

  • While there may be ways to update a table with another table through fancy subselects, I believe the best approach is to write a program to do this using the SQL API (whether it is DBI with the DBD::DB2 driver for perl, JDBC for Java, or the C libraries, etc) to perform the SELECT, FETCH each result row from the RESULTSET with a cursor, and do an update.

    PSEUDOCODE (I don't know what language you are familiar with):

    dbHandle = sqllib->open_connection(database, user, password) 
    select_statement = dbHandle->prepare('SELECT id FROM idstoupdate')
    update_statement = dbHandle->prepare('UPDATE clients SET col1=?, col2=? WHERE ID=?')
    resultset = statement->execute(select_statement)
    
    foreach (row in resultset) {
      id = row.getColumn('id')
      update_statement->execute('1','2',id) 
    }
    
    dbHandle->disconnect();  
    

    You would want to add error checking. If you want either all the updates to apply, or none, then you have to look into beginning a transaction and commiting the entire transaction if you have no errors. There is a wealth of material on how to do all the above in the DB2 Infocenter.

    Note: If your source data for the idstoupdate is a file, then you could skip the select statement and the work you do to load the idstoupdate table, and just read from the file and update the database. This would be the most efficient way to handle updates to a table.

    If you simply must update a table from another table with pure SQL, then the most common examples are in this format:

          UPDATE table1 t1
             SET (t1.field1, t1.field2) = 
                 (
                   SELECT t2.field1, 
                          t2.field2
                     FROM table2 t2
                    WHERE t1.joinfield = t2.joinfield 
                      AND t2.criteriafield = 'qualifier'
                 )
           WHERE EXISTS 
                 ( 
                   SELECT 1 
                     FROM table2
                    WHERE t1.joinfield = table2.joinfield 
                      AND t2.criteriafield = 'qualifier'
                 )    
    

    which eliminates the IN predicate, but is probably not much more memory or logspace efficient, and because it is less straightforward than a select, loop, fetch, update you need to be sure you have all the criteria correct. Your case is a bit simpler-- I think this would work, but I'd need a db2 instance to try it against:

          UPDATE clients t1
             SET t1.col1 = '1', col2 = 'y' 
           WHERE EXISTS 
                 ( 
                   SELECT 1 
                     FROM idstoupdate t2
                    WHERE t1.id = t2.id 
                 )  
    

    Edit: I am actually surprised that the query you gave did not work from SQL squirrel as it is a legitimate query. It may even perform similar to the example I gave, as DB2 is very good at optimizing SQL/determining the best access path.

    In my answer, I was trying to show the most memory efficient way to update a table, as well as the general pattern for updating rows in one table from another table using pure SQL (which would cover cases where one table contains more than just the rows you want to update).

    Additionally, I am suspicious of IN predicates that contain more than 20 or so values, even if modern database engines handle them with ease.

    However, the best way to examine if the database engine is handling your query efficiently and/or compare two SQL queries is to use the SQL explain commands.