Search code examples
saspivot-tableenterprise-guide

Pivot table in SAS


The data looks as following:

YEAR    MONTH   ID1 ID2 FIELD   FIELD_DESC
2017    4   123 2222    FFF red1
2017    4   123 2222    FFG red2
2017    4   224 2221    XYZ green1
2017    4   225 1234    TYU blue5

and desired output is

YEAR    MONTH   ID1 ID2 blue5   green1  red1    red2
2017    4   123 2222    0   0   1   1
2017    4   224 2221    0   1   0   0
2017    4   225 1234    1   0   0   0

On SQL server I used to run the following:

select year, month, id1, id2, [field1], [field2] .... [fieldn]
from (select year, month, id1, id2, field, field_desc from source_table) P
Pivot (count(field) for field_desc in ([field1], [field2] .... [fieldn])) Pvt

The above would give me counts of field of each possible description by year, month, id1, id2.

I'm trying to translate this to SAS.

One approach would be conditional sums

proc sql;
create table aggr_table as 
select year, month, id1, id2,
sum(case when field_desc = 'field1' then 1 else 0 end) as field1
...
sum(case when field_desc = 'fieldn' then 1 else 0 end) as fiendn
from source_table 
group by year, month, id1, id2;
quit;

Second way I tried to do it was by

proc sort data = source_table
 by year descending month descending id1 descending id2;
run;

data table_aggr (keep year month id1 id2 field1 .... fieldn);
set source_table
retain field1
....
fieldn 0);

if field_desc = 'field1' then do;
field1 = field1 +1;
end;
....
if field_desc = 'fieldn' then do;
fieldn = fieldn + 1;
end;

if last.id2 then 
output;
by year month id1 id2;
run;

But the second way does not seem to work

ERROR: BY variables are not properly sorted on data set WORK.SOURCE_TABLE

My questions: 1) My googling so far seems to indicate preference for doing this kind of data manipulation in the data step rather than proc sql step, is there a particular advantage ?

2) What did I do wrong in the data step ?

3) Is there a better way to replicate the SQL code at the top of the post ? Both SAS options I'm looking at look rather unwieldy compared to SQL original.

Thank you Ben


Solution

  • You can use PROC SUMMARY to count for you and then use PROC TRANSPOSE to turn the counts into variables instead of observations. If you just do it directly then you will get missing values instead of zero for the combinations that do not appear. You could post-process the file to replace the missing counts with zeros. Or use the method below to build a CLASSDATA table to feed to PROC SUMMARY to insure that all of zeros are included before the transpose.

    proc sql noprint ;
     create table classdata as
     select *
     from (select distinct year,month,id1,id2 from have) a
        , (select distinct field_desc from have) b
     ;
    quit;
    proc summary data=have nway classdata=classdata exclusive ;
      class year month id1 id2 field_desc ;
      output out=counts ;
    run;
    proc transpose data=counts out=want(drop=_name_);
      by year month id1 id2 ;
      id field_desc ;
      var _freq_;
    run;
    

    You could have PROC SQL generate the full rank counts directly with a little bit of tricky SQL. Combine the data with the full list of the id variable values and count how many times the two id variables match.

    proc sql noprint ;
     create table counts as
     select year,month,id1,id2
          , b.field_desc
          , sum(a.field_desc=b.field_desc) as count
     from have a
        , (select distinct field_desc from have) b
     group by year,month,id1,id2,b.field_desc
     order by year,month,id1,id2,b.field_desc
     ;
    quit;
    

    Generating the SQL is easy. If the number of derived variables is small then just generate the code into a macro variable. (If the list is large then generate the code using a data step using either call execute() or by writing it to a file and using %include to run it.)

    proc sql noprint ;
     select distinct
     catx(' '
         ,'sum(field_desc ='
         ,quote(trim(field_desc))
         ,') as'
         ,nliteral(field_desc)
         )
       into :code separated by ','
       from have
     ;
     create table want as
       select year, month, id1, id2
            , &code
       from have
       group by year, month, id1, id2
      ;
    quit;
    

    If you want to do it in a data step then look into using HASH object to gather that data. Then use PROC TRANSPOSE or a code generation technique (like in the SQL above) to transform from observations to variables.