The data looks as following:
YEAR MONTH ID1 ID2 FIELD FIELD_DESC
2017 4 123 2222 FFF red1
2017 4 123 2222 FFG red2
2017 4 224 2221 XYZ green1
2017 4 225 1234 TYU blue5
and desired output is
YEAR MONTH ID1 ID2 blue5 green1 red1 red2
2017 4 123 2222 0 0 1 1
2017 4 224 2221 0 1 0 0
2017 4 225 1234 1 0 0 0
On SQL server I used to run the following:
select year, month, id1, id2, [field1], [field2] .... [fieldn]
from (select year, month, id1, id2, field, field_desc from source_table) P
Pivot (count(field) for field_desc in ([field1], [field2] .... [fieldn])) Pvt
The above would give me counts of field of each possible description by year, month, id1, id2.
I'm trying to translate this to SAS.
One approach would be conditional sums
proc sql;
create table aggr_table as
select year, month, id1, id2,
sum(case when field_desc = 'field1' then 1 else 0 end) as field1
...
sum(case when field_desc = 'fieldn' then 1 else 0 end) as fiendn
from source_table
group by year, month, id1, id2;
quit;
Second way I tried to do it was by
proc sort data = source_table
by year descending month descending id1 descending id2;
run;
data table_aggr (keep year month id1 id2 field1 .... fieldn);
set source_table
retain field1
....
fieldn 0);
if field_desc = 'field1' then do;
field1 = field1 +1;
end;
....
if field_desc = 'fieldn' then do;
fieldn = fieldn + 1;
end;
if last.id2 then
output;
by year month id1 id2;
run;
But the second way does not seem to work
ERROR: BY variables are not properly sorted on data set WORK.SOURCE_TABLE
My questions: 1) My googling so far seems to indicate preference for doing this kind of data manipulation in the data step rather than proc sql step, is there a particular advantage ?
2) What did I do wrong in the data step ?
3) Is there a better way to replicate the SQL code at the top of the post ? Both SAS options I'm looking at look rather unwieldy compared to SQL original.
Thank you Ben
You can use PROC SUMMARY to count for you and then use PROC TRANSPOSE to turn the counts into variables instead of observations. If you just do it directly then you will get missing values instead of zero for the combinations that do not appear. You could post-process the file to replace the missing counts with zeros. Or use the method below to build a CLASSDATA table to feed to PROC SUMMARY to insure that all of zeros are included before the transpose.
proc sql noprint ;
create table classdata as
select *
from (select distinct year,month,id1,id2 from have) a
, (select distinct field_desc from have) b
;
quit;
proc summary data=have nway classdata=classdata exclusive ;
class year month id1 id2 field_desc ;
output out=counts ;
run;
proc transpose data=counts out=want(drop=_name_);
by year month id1 id2 ;
id field_desc ;
var _freq_;
run;
You could have PROC SQL generate the full rank counts directly with a little bit of tricky SQL. Combine the data with the full list of the id variable values and count how many times the two id variables match.
proc sql noprint ;
create table counts as
select year,month,id1,id2
, b.field_desc
, sum(a.field_desc=b.field_desc) as count
from have a
, (select distinct field_desc from have) b
group by year,month,id1,id2,b.field_desc
order by year,month,id1,id2,b.field_desc
;
quit;
Generating the SQL is easy. If the number of derived variables is small then just generate the code into a macro variable. (If the list is large then generate the code using a data step using either call execute()
or by writing it to a file and using %include
to run it.)
proc sql noprint ;
select distinct
catx(' '
,'sum(field_desc ='
,quote(trim(field_desc))
,') as'
,nliteral(field_desc)
)
into :code separated by ','
from have
;
create table want as
select year, month, id1, id2
, &code
from have
group by year, month, id1, id2
;
quit;
If you want to do it in a data step then look into using HASH object to gather that data. Then use PROC TRANSPOSE or a code generation technique (like in the SQL above) to transform from observations to variables.