Search code examples
sas

Replace values in a dataset with values in another dataset but only when a variable takes a specific value


suppose to have the following dataset:

data have;
  input ID :$20. Admission :date09. Discharge :date09. Index Variable1 Variable2; 
  format Admission date9. Discharge date9.;
cards;
0001 13JAN2015 20JAN2015 1  2  4  
0001 21FEB2015 31DEC2015 0  .  2
0001 21FEB2015 31DEC2015 0  3  .
0001 01JAN2019 31DEC2019 0  6  9
0002 01JAN2015 31DEC2015 0  2  .
0002 01JAN2019 31OCT2019 1  .  2
0002 01JAN2019 31OCT2019 0  .  2
;

Than, I have another dataset that looks like this and comes from other calculation:


data have1;
  input ID :$20. Admission :date09. Discharge :date09. Index Variable1 Variable2; 
  format Admission date9. Discharge date9.;
cards;
0001 13JAN2015 20JAN2015 1  6  8
0002 01JAN2019 31OCT2019 1  4  2
;

Is there a way to get the following?

data have2;
  input ID :$20. Admission :date09. Discharge :date09. Index Variable1 Variable2; 
  format Admission date9. Discharge date9.;
cards;
0001 13JAN2015 20JAN2015 1  6  8 
0001 21FEB2015 31DEC2015 0  .  .
0001 21FEB2015 31DEC2015 0  .  .
0001 01JAN2019 31DEC2019 0  .  .
0002 01JAN2015 31DEC2015 0  .  .
0002 01JAN2019 31OCT2019 1  4  2
0002 01JAN2019 31OCT2019 0  .  .
;

In other words I would like to replace values in have with values in have1 (where Index = 1). The remaining in have should be set to "." to have have3. The total number of variable: is 50. Here only 2 are shown for simplicity.

Thank you in advance


Solution

  • You can do this in two ways:

    1. A simple merge
    2. A hash join

    Let's take a look at both.

    Simple Merge

    We want to merge by the key, which is the combination of ID Admission Discharge Index. First we'll sort both datasets and then merge them by this key. If the combination of keys match, we'll set Variable1 and Variable2 to missing.

    proc sort data=have;
        by id admission discharge index;
    run;
    
    proc sort data=have1;
        by id admission discharge index;
    run;
    
    data want;
        merge have(in=h)
              have1(in=h1)
        ;
        by id admission discharge index;
    
        if(h NE h1) then call missing(of variable:);
    run;
    

    Hash Join

    This is very similar to a simple merge but it does not require sorting and has the highest performance. We'll treat have1 as a lookup table and pull down values from it for each match. If we do not find a match, we'll set the values of Variable1 and Variable2 to missing.

    data want;
        set have;
    
        if(_N_ = 1) then do;
            dcl hash h1(dataset:'have1');
                h1.defineKey('id', 'admission', 'discharge', 'index');
                h1.defineData('variable1', 'variable2');
            h1.defineDone();
        end;
    
        if(h1.Find() NE 0) then call missing(of variable:);
    run;
    
    ID      Admission   Discharge   Index   Variable1   Variable2
    0001    13JAN2015   20JAN2015   1       6           8
    0001    21FEB2015   31DEC2015   0       .           .
    0001    21FEB2015   31DEC2015   0       .           .
    0001    01JAN2019   31DEC2019   0       .           .
    0002    01JAN2015   31DEC2015   0       .           .
    0002    01JAN2019   31OCT2019   1       4           2
    0002    01JAN2019   31OCT2019   0       .           .