Search code examples
sqlperformancesaslarge-data

PROC SQL Update Efficiency for Large Datasets


I have a SAS Master Dataset with 10 Million Rows and 1800 columns. I need to update 10 columns using a transaction dataset with 15 million records, only for records with matching key. I tried running a proc sql update statement using the following code.

proc sql;
UPDATE lib1.master1 a

SET col1 = (SELECT col1 FROM lib1.changes_1 b WHERE a.key=b.key),
    col2 = (SELECT col2 FROM lib1.changes_1 b WHERE a.key=b.key),
    col3 = (SELECT col3 FROM lib1.changes_1 b WHERE a.key=b.key),
    col4 = (SELECT col4 FROM lib1.changes_1 b WHERE a.key=b.key),
    col5 = (SELECT col5 FROM lib1.changes_1 b WHERE a.key=b.key),
    col6 = (SELECT col6 FROM lib1.changes_1 b WHERE a.key=b.key),
    col7 = (SELECT col7 FROM lib1.changes_1 b WHERE a.key=b.key),
    col8 = (SELECT col8 FROM lib1.changes_1 b WHERE a.key=b.key),
    col9 = (SELECT col9 FROM lib1.changes_1 b WHERE a.key=b.key)

WHERE EXISTS ( SELECT 1 FROM lib1.changes_1 b WHERE A.key = B.key);
quit;

I tried for just col1 for testing purpose and its been running for more than 4 hours.

I can think of data merge, by dropping the 10 columns and then a left join but that would change the column ordering. Reordering 1800 columns would again be a tedious task.

Is there a faster/more efficient technique?


Solution

  • To replace one column, a format (roughly similar to Bendy's approach) is easiest.

    To replace ten columns, always coming from the same row, I recommend a hash table. Around the same speed as a single format, typically. (Formats actually can be a bit slow at the 10MM rows mark, so this might be purely faster even than one.)

    This took ~30 seconds on my laptop (CPU time, and real time; I have an SSD, so those are similar. On a HDD this is probably 30 seconds CPU time and a few minutes real time.)

    *make some dummy data;
    data maindata;
      array col(10);
      do _i = 1 to dim(col);
        col[_i] = _i;
      end;
      do key = 1 to 1e7;
        output;
      end;
    run;
    data updatedata;
      array col(10);
      do _i = 1 to dim(col);
        col[_i] = 100+_i;
      end;
      do key = 1 to 3e7 by 2;
        output;
      end;
    run;
    
    *using MODIFY here which is identical in function to SQL UPDATE - does not make a new dataset;
    data maindata;  
      if _n_=1 then do;
        declare hash ud(dataset:'updatedata');
        ud.defineKey('key');
        ud.defineData('col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9', 'col10');
        ud.defineDone();
      end;
      modify maindata;
      rcUpdate = ud.find();
      if rcUpdate=0 then replace;
    run;