Search code examples
sassas-macro

Sas Macro to semi-efficiently manipulate data


Objective: Go from Have table + Help table to Want table. The current implementation (below) is slow. I believe this is a good example of how not to use SAS Macros, but I'm curious as to whether... 1. the macro approach could be salvaged / made fast enough to be viable (e.g. proc append is supposed to speed up the action of stacking datasets, but I was unable to see any performance gains.) 2. what all the alternatives would look like.

I have written a non-macro solution that I will post below for comparison sake.

Data: 
data have ; 
input name $ term $; 
cards;
Joe   2000 
Joe   2002
Joe   2008 
Sally 2001
Sally 2003
; run; 

proc print ; run; 

data help ; 
input terms $ ; 
cards; 
2000
2001
2002
2003
2004
2005
2006
2007
2008
; run; 

proc print ; run; 

data want ; 
input name $ term $ status $; 
cards;
Joe   2000  here
Joe   2001  gone
Joe   2002  here
Joe   2003  gone
Joe   2004  gone
Joe   2005  gone
Joe   2006  gone
Joe   2007  gone
Joe   2008  here
Sally 2001  here
Sally 2002  gone
Sally 2003  here
; run; 

proc print data=have ; run; 

I can write a little macro to get me there for each individual:

%MACRO RET(NAME); 
proc sql ; 
create table studtermlist as 
select distinct term 
from have 
where NAME = "&NAME"
; 
SELECT Max(TERM) INTO :MAXTERM 
FROM HAVE
WHERE NAME = "&NAME"
; 
SELECT MIN(TERM) INTO :MINTERM 
FROM HAVE
WHERE NAME = "&NAME"
; 
CREATE TABLE TERMLIST AS 
SELECT TERMS  
FROM HELP 
WHERE TERMS BETWEEN "&MINTERM." and "&MAXTERM."
ORDER BY TERMS 
;
CREATE TABLE HEREGONE_&Name AS 
SELECT 
A.terms , 
"&Name" as Name,
CASE 
WHEN TERMS EQ TERM THEN  'Here'
when term is null THEN 'Gone'
end as status
from termlist a left join studtermlist b 
 on a.terms eq b.term 
; 
quit; 
%MEND RET ; 


%RET(Joe);
%RET(Sally);

proc print data=HEREGONE_Joe; run; 
proc print data=HEREGONE_Sally; run; 

But it's incomplete. If I loop through for (presumably quite a few names)...

*******need procedure for all names - grab info on have ; 
proc sql noprint; 
select distinct name into :namelist separated by ' '
from have
; quit;

%let n=&sqlobs ; 


%MACRO RETYA ; 
OPTIONS NONOTEs ; 
%do i = 1 %to &n ; 
 %let currentvalue = %scan(&namelist,&i); 
 %put &currentvalue ; 
 %put &i ; 
%RET(&currentvalue);
%IF &i = 1 %then %do ; 
data base; set HEREGONE_&currentvalue; run; 
                 %end; 
%IF &i gt 1 %then %do ; 
proc sql ; create table base as 
select * from base
union 
select * from HEREGONE_&currentvalue
;
drop table HEREGONE_&currentvalue;
quit;
                 %end; 
%end ; 
OPTIONS NOTES; 
%MEND; 

%RETYA ; 

proc sort data=base ; by name terms; run; 
proc print data=base; run; 

So now I have want, but with 6,000 names, it takes over 20 minutes.


Solution

  • Let's try the alternative solution. For each name find the min/max term via a proc SQL data step. Then use a data step to create the time period table and merge that with your original table.

    *Sample data;
    data have ; 
    input name $ term ; 
    cards;
    Joe   2000 
    Joe   2002
    Joe   2008 
    Sally 2001
    Sally 2003
    ; run; 
    
    *find min/max of each name;
    proc sql;
    create table terms as
    select name, min(term) as term_min, max(term) as term_max
    from have
    group by name
    order by name;
    quit;
    
    *Create table with the time periods for each name;
    data empty;
    set terms;
    do term=term_min to term_max;
    output;
    end;
    drop term_min term_max;
    run;
    
    *Create final table by merging the original table with table previously generated;
    proc sql;
    create table want as
    select a.name, a.term, case when missing(b.term) then 'Gone'
                            else 'Here' end as status
    from empty a
    left join have b
    on a.name=b.name
    and a.term=b.term
    order by a.name, a.term;
    quit;
    

    EDIT: Now looking at your macro solution, part of the problem is that you're scanning your table too many times.

    • The first table, studenttermlist is not required, the last join can be filtered instead.
    • The two macro variables, min/max term can be calculated in a single pass
    • Avoid the smaller interim term list and use a where clause to filter your results
    • Use Call Execute to call your macro rather than another macro loop
    • Rather than loop through to append the data, take advantage of a naming convention and use a single data step to append all outputs.

      %MACRO RET(NAME); 
      proc sql noprint; 
      
      SELECT MIN(TERM), Max(TERM) INTO :MINTERM,  :MAXTERM
      FROM HAVE
      WHERE NAME = "&NAME"
      ; 
      
      
      CREATE TABLE _HG_&Name AS 
      SELECT 
      A.terms , 
      "&Name" as Name,
      CASE 
      WHEN TERMS EQ TERM THEN  'Here'
      when term is null THEN 'Gone'
      end as status
      from help a 
      left join have b 
       on a.terms eq b.term 
       and b.name="&name"
       where a.terms between "&minterm" and "&maxterm";
      ; 
      quit; 
      %MEND RET ; 
      
      
      *call macro;
      proc sort data=have;
      by name term;
      run;
      
      data _null_;
          set have;
          by name;
          if first.name then do;
          str=catt('%ret(', name, ');');
          call execute(str);
          end;
      run;
      
      
      *append results;
      data all;
          set _hg:;
      run;