I have a variable I created based on a certain data. now in this new data I need to calculate different statistic parameters, but with conditions for example: *median of this new var only for obs that their birth country is not Italy. *mean of a different var only when age>35, *Q1 and Q3 of 2 types of the same var (Female and Male for example) and so on. do I use the PROC FREQ or the PROC MEANS- because it includes all these stats? either way this is not working for me..how can I reform this procedure on a single var from data?
proc means data=dat2;
where "birth_country" NE Italy";
run;
proc means data dat2;
where Mage>=35;
run;
I wouldn't create a separate data step as suggested by Alex A. That can be a bad habit to develop as, with large datasets, it can be extremely costly in terms of CPU.
Rather, I would subset the Proc Means call but slightly differently from Alex A's suggestion since you probably don't want the output generated from a "by" statement since the 'by' statement requires the data be sorted by the 'by' variables (another costly CPU mistake to minimize):
proc means data=dat2(where=(country ne 'ITALY'))
median n mean q1 q3 noprint nway;
var NewVar;
output out=median1(drop=_:) n=n median=median mean=mean q1=q1 q3=q3;
run;
proc print data=median1;
run;
proc means data=dat2(where=(age>35)) n median mean q1 q3 noprint nway;
var DiffVar;
class sex;
output out=median2(drop=_:) n=n median=median mean=mean q1=q1 q3=q3;
run;
proc print data=median2;
run;
The 'noprint' option suppresses SAS writing the output to the listing file.
The 'nway' option suppresses inclusion of the automatic _ type_ variables that are generated for sex -- the class variable (as Alex A. notes, SAS would produce three levels or _ type_ variables for each requested metric: 2 for gender and one overall).
The 'drop=_:' statement strips out any variable with an underscore in the first character. For Proc Means, this would include the automatic variables _ type_ and _ freq_ as well as any other variable in the dataset that began with an underscore.
Adding the 'n' option to the Proc Means call gives you the frequency of each subset for the class variable where the _ freq_ variable only gives you the overall sample of nonmissing information and does not break that down by levels of the class statement.
Alternatively, you can read the data into the Proc Means calls with a 'where' statement. I'm not sure but my impression is that subsetting the data with the 'data=' call is more computationally efficient. I'm deducing this from the general SAS rule about avoiding executable statements and keeping 'if,' 'where' and other commands at the level of the PDV (program data vector) insofar as this is possible:
proc means data=dat2 median n mean q1 q3 noprint nway;
var NewVar;
where country ne 'ITALY';
output out=median1(drop=_:) n=n median=median mean=mean q1=q1 q3=q3;
run;