I'm a SAS beginner and I'm curious if the following task can be done much more simple as it is currently in my head.
I have the following (simplified) meta data in a table named user_date_money:
User - Date - Money
with various users and dates for every calendar day (for the last 4 years). The data is ordered by User ASC and Date ASC, sample data looks like this:
User | Date | Money
Anna 23.10.2013 5
Anna 24.10.2013 1
Anna 25.10.2013 12
....
Aron 23.10.2013 5
Aron 24.10.2013 12
Aron 25.10.2013 4
....
Zoe 23.10.2013 1
Zoe 24.10.2013 1
Zoe 25.10.2013 0
I now want to calculate a five day moving average for the Money. I started with the pretty popular apprach with the lag() function like this:
data cma;
set user_date_money;
if missing(money) then
do;
OBS = 0;
money = 0.0;
end;
else OBS = 1;
money5 = lag5(money);
OBS5= lag5(obs);
if missing(money5) then money5= 0.0;
if missing(obs5) then obs5= 0;
if _N_ = 1 then
do;
SUM = 0.0;
N = 0;
end;
else;
sum = sum + money-money5;
n = n + obs-obs5;
MEAN = sum / n ;
retain sum n;
run;
as you see, the problem with this method occurs if there if the data step runs into a new user. Aron would get some lagged values from Anna which of course should not happen.
Now my question: I am pretty sure you can handle the user switch by adding some extra fields like laggeduser and by resetting the N, Sum and Mean variables if you notice such a switch but:
Can this be done in an easier way? Perhaps using the BY Clause in any way? Thanks for your ideas and help!
Best regards
I think the easiest way is to use PROC EXPAND:
PROC EXPAND data=user_date_money out=cma;
ID date;
BY user;
CONVERT money=MEAN / transformin=(setmiss 0) transformout=(movave 5);
RUN;
And as mentioned in John's comment, it's important to remember about missing values (and about beginning and ending observations as well). I've added SETMISS option to the code, as you made it clear that you want to 'zerofy' missing values, not ignore them (default MOVAVE behaviour). And if you want to exclude first 4 observations for each user (since they don't have enough pre-history to calculate moving average 5), you can use option 'TRIMLEFT 4' inside TRANSFORMOUT=().