Search code examples
sqlsasdatastep

sas - calculate moving average for grouped data with BY statement


I'm a SAS beginner and I'm curious if the following task can be done much more simple as it is currently in my head.

I have the following (simplified) meta data in a table named user_date_money:

User - Date - Money

with various users and dates for every calendar day (for the last 4 years). The data is ordered by User ASC and Date ASC, sample data looks like this:

User  | Date     | Money
Anna   23.10.2013   5
Anna   24.10.2013   1
Anna   25.10.2013   12
      ....       
Aron   23.10.2013   5
Aron   24.10.2013   12
Aron   25.10.2013   4 
     ....
Zoe    23.10.2013   1
Zoe    24.10.2013   1
Zoe    25.10.2013   0

I now want to calculate a five day moving average for the Money. I started with the pretty popular apprach with the lag() function like this:

data cma; 
set user_date_money;
if missing(money) then
do;
OBS = 0;
money = 0.0;
end;
else OBS = 1;
money5 = lag5(money);
OBS5= lag5(obs);
if missing(money5) then money5= 0.0;
if missing(obs5) then obs5= 0;

if _N_ = 1 then
do;
SUM = 0.0;
N = 0;
end;
else;
sum = sum + money-money5;
n = n + obs-obs5;
MEAN = sum / n ;
retain sum n;
run;

as you see, the problem with this method occurs if there if the data step runs into a new user. Aron would get some lagged values from Anna which of course should not happen.

Now my question: I am pretty sure you can handle the user switch by adding some extra fields like laggeduser and by resetting the N, Sum and Mean variables if you notice such a switch but:

Can this be done in an easier way? Perhaps using the BY Clause in any way? Thanks for your ideas and help!

Best regards


Solution

  • I think the easiest way is to use PROC EXPAND:

    PROC EXPAND data=user_date_money out=cma;
      ID date;
      BY user;
      CONVERT money=MEAN / transformin=(setmiss 0) transformout=(movave 5);
    RUN;
    

    And as mentioned in John's comment, it's important to remember about missing values (and about beginning and ending observations as well). I've added SETMISS option to the code, as you made it clear that you want to 'zerofy' missing values, not ignore them (default MOVAVE behaviour). And if you want to exclude first 4 observations for each user (since they don't have enough pre-history to calculate moving average 5), you can use option 'TRIMLEFT 4' inside TRANSFORMOUT=().