Search code examples
mergesasmany-to-manyretain

How do you set retained values to missing if there's no match in a many-to-many SAS merge ?


I have two datasets that I want to merge together that have repeated by variables, but an unequal number of rows. In SAS, the default behavior is to retain values for all the rows that don't have a match.

For example:

data a;
    input i x y;
    datalines;
        1 1 5
        1 2 6
        1 3 7
        1 4 8
        ;
run;

data b;
    input i f g $;
    datalines;
        1 9 aa
        1 8 bb
        ;
run;

Here dataset a has four rows of the by variable i, while dataset b only has two.

Merging solely with the by variable i produces this:

data c;
    merge a b;
    by i;
run; 

Obs    i    x    y    f    g
1     1    1    5    9    aa
2     1    2    6    8    bb
3     1    3    7    8    bb
4     1    4    8    8    bb

You can see that for variables f and g in obs 3 and 4 the values have been retained since they didn't have a match in dataset a.

What I am trying to produce is this output:

Obs    i    x    y    f    g
1     1    1    5    9    aa
2     1    2    6    8    bb
3     1    3    7    .    
4     1    4    8    .    

I am using SAS 9.4 and this is what I've tried:

data c;
    if _n_>1 then do;
        array num{*} _numeric_;
        array char{*} _character_;
        call missing(of num{*});
        call missing(of char{*});
        end;
    merge a b;
    by i;
run;

My thinking is that for every row after the first, I want to set all variables to missing and so that if they don't have a matching row their values won't be overwritten and will remain missing. This would eliminate retained values.

By the second row the PDV should be created and all the metadata should be available to create these arrays and set them to missing, but I am getting this error:

WARNING: Defining an array with zero elements.

Any suggestions on how to fix this code or other code that would do the trick?


Solution

  • You would want to override the default behavior of the run statement, namely the automatic output and automatic call missing of certain variables.

    Here you output; to force the automatic output (same as default behavior) and then call missing(of _all_); which sets all variables to missing (as opposed to only ones not appearing on the merge or set statements).

    data c;
        merge a b;
        by i;
        output;
        call missing(of _all_);
    run; 
    

    The reason you have to do it at the end and not the beginning is that you haven't defined any variables yet at the beginning - so _numeric_, _character_, or _all_ don't have anything to refer to.

    You can fix this, with an if 0 then set a b;, but I find the above solution a bit more straightforward. Really either works fine and has the same speed and benefit.

    data c;
        if 0 then set a b; *defines all of the variables, but `if 0` means it will not pull any data;
        call missing(of _all_); *sets everything missing;
        merge a b;
        by i;
    run;