Search code examples
sasleft-joinhashtable

Left join using HASH with missing values to be replaced with latest available value in SAS


I will have to join the following two tables.

Table one (1 billion rows)

input snapshotdate sourcekey sourcesystemid value1;
datalines;
20200101 112 5 788
20200102 112 5 789
20200103 112 5 800
20200201 112 5 786
20200202 112 5 777
20200203 112 5 834
20200301 112 5 789
20200302 112 5 771
20200303 112 5 832
20200101 222 6 788
20200102 222 6 789
20200103 222 6 800
20200201 222 6 786
20200202 222 6 777
20200203 222 6 834
20200301 222 6 789
20200302 222 6 771
20200303 222 6 832
;
run;

Table Two (32 million rows)

Data two;
input period sourcekey sourcesystemid npl;
datalines;
202001 112 5 999
202002 112 5 988
202001 222 6 555
202002 222 6 556
;
run;

I would like to get the joined table as below;

snapshotdate sourcekey sourcesystemid value1  NPL
20200101      112         5            788    999
20200102      112         5            789    999
20200103      112         5            800    999
20200201      112         5            786    988
20200202      112         5            777    988
20200203      112         5            834    988
20200301      112         5            789    988
20200302      112         5            771    988
20200303      112         5            832    988
20200101      222         6            788    555
20200102      222         6            789    555
20200103      222         6            800    555
20200201      222         6            786    556
20200202      222         6            777    556
20200203      222         6            834    556
20200301      222         6            789    556
20200302      222         6            771    556
20200303      222         6            832    556

When there is a missing (year and month), it has to be filled with the latest available value, The code I have currently: (not working to replace the missing values).

Proc SQL;
Create Table want as
Select 
a.*,
b.npl
from one as a
left join two as b
on a.sourcekey =b.sourcekey and a.sourcesystemid = b.sourcesystemid and input(substr(put(a.snapshotdate,8.),1,6),6.) = b.period
order by a.sourcekey,a.snapshotdate
;
Quit;

Since it involves large tables, I prefer this to be coded using HASH. I would like to use table two as HASH object.

Thanks in advance.


Solution

  • If I understand your question correctly, and you have sufficient memory, you can do this

    data want(drop=period rc);
       if _N_ = 1 then do;
          dcl hash h(dataset : "two");
          h.definekey("sourcekey", "sourcesystemid", "period");
          h.definedata("npl");
          h.definedone();
       end;
    
       set one;
       if 0 then set two;
    
       rc = h.find(key : sourcekey, key : sourcesystemid, key : int(snapshotdate/100));
    run;