Search code examples
sqlarrayssascoalesce

SAS: most efficient method to output first non-missing across multiple columns


The data I have are millions of rows and rather sparse with anywhere between 3 and 10 variables needing processed. My end result needs to be one single row containing the first non-missing value for each column. Take the following test data:

** test data **;
data test;
    length ID $5 AID 8 TYPE $5;
    input ID $ AID TYPE $;
    datalines;
    A   .   .
    .   123 .
    C   .   XYZ
    ;
run;

The end result should look like such:

ID  AID TYPE
A   123 XYZ

Using macro lists and loops I can brute force this result with multiple merge statements where the variable is non-missing and obs=1 but this is not efficient when the data are very large (below I'd loop over these variables rather than write multiple merge statements):

** works but takes too long on big data **;
data one_row;
    merge 
        test(keep=ID where=(ID ne "") obs=1) /* character */
        test(keep=AID where=(AID ne .) obs=1) /* numeric */
        test(keep=TYPE where=(TYPE ne "") obs=1); /* character */
run;

The coalesce function seems very promising, but I believe I need it in combination with array and output to build this single-row result. The function also differs (coalesce and coalescec depending on variable type) whereas it does not matter using proc sql. I get an error using array since all variables in the array list are not the same type.


Solution

  • Exactly what is most efficient will largely depend on the characteristics of your data. In particular, whether the first nonmissing value for the last variable is usually relatively "early" in the dataset, or if you usually will have to trawl through the entire dataset to get to it.

    I assume your dataset is not indexed (as that would simplify things greatly).

    One option is the standard data step. This isn't necessarily fast, but it's probably not too much slower than most other options given you're going to have to read most/all of the rows no matter what you do. This has a nice advantage that it can stop when every row is complete.

    data want;
      if 0 then set test; *defines characteristics;
      set test(rename=(id=_id aid=_aid type=_type)) end=eof;  
      id=coalescec(id,_id);
      aid=coalesce(aid,_aid);
      type=coalescec(type,_type);
      if cmiss(of id aid type)=0 then do;
        output;
        stop;
      end;
      else if eof then output;
      drop _:;
    run;
    

    You could populate all of that from macro variables from dictionary.columns, or even might use temporary arrays, though I think that gets too messy.

    Another option is the self update, except it needs two changes. One, you need something to join on (as opposed to merge which can have no by variable). Two, it will give you the last nonmissing value, not the first, so you'd have to reverse-sort the dataset.

    But assuming you added x to the first dataset, with any value (doesn't matter, but constant for every row), it is this simple:

    data want;
      update test(obs=0) test;
      by x;
    run;
    

    So that has the huge advantage of simplicity of code, exchanged for some cost of time (reverse sorting and adding a new variable).

    If your dataset is very sparse, a transpose might be a good compromise. Doesn't require knowing the variable names as you can process them with arrays.

    data test_t;
      set test;
      array numvars _numeric_;
      array charvars _character_;
      do _i = 1 to dim(numvars);
        if not missing(numvars[_i]) then do;
          varname = vname(numvars[_i]);
          numvalue= numvars[_i];
          output;
        end;
      end;
      do _i = 1 to dim(charvars);
        if not missing(charvars[_i]) then do;
          varname = vname(charvars[_i]);
          charvalue= charvars[_i];
          output;
        end;
      end;
      keep numvalue charvalue varname;
    run;
    
    
    proc sort data=test_t;
      by varname;
    run;
    
    data want;
      set test_t;
      by varname;
      if first.varname;
    run;
    

    Then you proc transpose this to get the desired want (or maybe this works for you as is). It does lose the formats/etc. on the value, so take that into account, and your character value length probably needs to be set to something appropriately long - and then set back (you can use an if 0 then set to fix it).

    A similar hash approach would work roughly the same way; it has the advantage that it would stop much sooner, and doesn't require resorting.

    data test_h;
      set test end=eof;
    
      array numvars _numeric_;
      array charvars _character_;
      length varname $32 numvalue 8 charvalue $1024; *or longest charvalue length;
      if _n_=1 then do;
        declare hash h(ordered:'a');
        h.defineKey('varname');
        h.defineData('varname','numvalue','charvalue');
        h.defineDone();
      end;
    
      do _i = 1 to dim(numvars);
        if not missing(numvars[_i]) then do;
          varname = vname(numvars[_i]);
          rc = h.find();
          if rc ne 0 then do;
            numvalue= numvars[_i];
            rc=h.add();
          end;    
        end;
      end;
      do _i = 1 to dim(charvars);
        if not missing(charvars[_i]) then do;
          varname = vname(charvars[_i]);
          rc = h.find();
          if rc ne 0 then do;
            charvalue= charvars[_i];
            rc=h.add();
          end;    
        end;
      end;
    
      if eof or h.num_items = dim(numvars) + dim(charvars) then do;
         rc = h.output(dataset:'want');
      end;
    run;
    

    There are lots of other solutions, just depending on your data which would be most efficient.