I have a SAS Master Dataset with 10 Million Rows and 1800 columns. I need to update 10 columns using a transaction dataset with 15 million records, only for records with matching key. I tried running a proc sql update statement using the following code.
proc sql;
UPDATE lib1.master1 a
SET col1 = (SELECT col1 FROM lib1.changes_1 b WHERE a.key=b.key),
col2 = (SELECT col2 FROM lib1.changes_1 b WHERE a.key=b.key),
col3 = (SELECT col3 FROM lib1.changes_1 b WHERE a.key=b.key),
col4 = (SELECT col4 FROM lib1.changes_1 b WHERE a.key=b.key),
col5 = (SELECT col5 FROM lib1.changes_1 b WHERE a.key=b.key),
col6 = (SELECT col6 FROM lib1.changes_1 b WHERE a.key=b.key),
col7 = (SELECT col7 FROM lib1.changes_1 b WHERE a.key=b.key),
col8 = (SELECT col8 FROM lib1.changes_1 b WHERE a.key=b.key),
col9 = (SELECT col9 FROM lib1.changes_1 b WHERE a.key=b.key)
WHERE EXISTS ( SELECT 1 FROM lib1.changes_1 b WHERE A.key = B.key);
quit;
I tried for just col1 for testing purpose and its been running for more than 4 hours.
I can think of data merge, by dropping the 10 columns and then a left join but that would change the column ordering. Reordering 1800 columns would again be a tedious task.
Is there a faster/more efficient technique?
To replace one column, a format (roughly similar to Bendy's approach) is easiest.
To replace ten columns, always coming from the same row, I recommend a hash table. Around the same speed as a single format, typically. (Formats actually can be a bit slow at the 10MM rows mark, so this might be purely faster even than one.)
This took ~30 seconds on my laptop (CPU time, and real time; I have an SSD, so those are similar. On a HDD this is probably 30 seconds CPU time and a few minutes real time.)
*make some dummy data;
data maindata;
array col(10);
do _i = 1 to dim(col);
col[_i] = _i;
end;
do key = 1 to 1e7;
output;
end;
run;
data updatedata;
array col(10);
do _i = 1 to dim(col);
col[_i] = 100+_i;
end;
do key = 1 to 3e7 by 2;
output;
end;
run;
*using MODIFY here which is identical in function to SQL UPDATE - does not make a new dataset;
data maindata;
if _n_=1 then do;
declare hash ud(dataset:'updatedata');
ud.defineKey('key');
ud.defineData('col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 'col9', 'col10');
ud.defineDone();
end;
modify maindata;
rcUpdate = ud.find();
if rcUpdate=0 then replace;
run;