I will have to join the following two tables.
Table one (1 billion rows)
input snapshotdate sourcekey sourcesystemid value1;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;
Table Two (32 million rows)
Data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;
I would like to get the joined table as below;
snapshotdate sourcekey sourcesystemid value1 NPL
20200101 112 5 788 999
20200102 112 5 789 999
20200103 112 5 800 999
20200201 112 5 786 988
20200202 112 5 777 988
20200203 112 5 834 988
20200301 112 5 789 988
20200302 112 5 771 988
20200303 112 5 832 988
20200101 222 6 788 555
20200102 222 6 789 555
20200103 222 6 800 555
20200201 222 6 786 556
20200202 222 6 777 556
20200203 222 6 834 556
20200301 222 6 789 556
20200302 222 6 771 556
20200303 222 6 832 556
When there is a missing (year and month), it has to be filled with the latest available value, The code I have currently: (not working to replace the missing values).
Proc SQL;
Create Table want as
Select
a.*,
b.npl
from one as a
left join two as b
on a.sourcekey =b.sourcekey and a.sourcesystemid = b.sourcesystemid and input(substr(put(a.snapshotdate,8.),1,6),6.) = b.period
order by a.sourcekey,a.snapshotdate
;
Quit;
Since it involves large tables, I prefer this to be coded using HASH. I would like to use table two as HASH object.
Thanks in advance.
If I understand your question correctly, and you have sufficient memory, you can do this
data want(drop=period rc);
if _N_ = 1 then do;
dcl hash h(dataset : "two");
h.definekey("sourcekey", "sourcesystemid", "period");
h.definedata("npl");
h.definedone();
end;
set one;
if 0 then set two;
rc = h.find(key : sourcekey, key : sourcesystemid, key : int(snapshotdate/100));
run;