Search code examples
sasaggregateaggregate-functionsperiod

Aggregating Over Actual Year in SAS


Lets suppose we have the following table ("Purchases"):

Date                 Units_Sold             Brand       Year
18/03/2010                5                   A         2010
12/04/2010                2                   A         2010
22/05/2010                1                   A         2010
25/05/2010                7                   A         2010
11/08/2011                5                   A         2011
12/07/2010                2                   B         2010
22/10/2010                1                   B         2010
05/05/2011                7                   B         2011

And the same logic continues until the end of 2014, for different brands.

What I want to do is calculate the number of Units_Sold for every Brand, in each year. However, I don't want to do it for the calendar year, but for the actual year.

So an example of what I don't want:

proc sql;
create table Dont_Want as
select Year, Brand, sum(Units_Sold) as Unit_per_Year
from Purchases
group by Year, Brand;
quit;

The above logic is ok if we know that e.g. Brand "A" exists throughout the whole 2010. But if Brand "A" appeared on 18/03/2010 for the first time, and exists until now, then a comparison of Years 2010 and 2011 would not be good enough as for 2010 we are "lacking" 3 months.

So what I want to do is calculate:

for A: the sum from 18/03/2010 until 17/03/2011, then from 18/03/2011 until 17/03/2012, etc.

for B: the sum from 12/07/2010 until 11/07/2011, etc.

and so on for all Brands.

Is there a smart way of doing this?


Solution

  • Step 1: Make sure your dataset is sorted or indexed by Brand and Date

    proc sort data=want;
         by brand date;
    run;
    

    Step 2: Calculate the start/end dates for each product

    The idea behind the below code:

    1. We know that the first occurrence of the brand in the sorted dataset is the day in which the brand was introduced. We'll call this Product_Year_Start.

    2. The intnx function can be used to increment that date by 365 days, then subtract 1 from it. Let's call this date Product_Year_End.

    3. Since we now know the product's year end date, we know that if the date on any given row exceeds the product's year end date, we have started the next product year. We'll just take the calculated Product_Year_End and Product_Year_Start for that brand and bump them up by one year.

    This is all achieved using by-group processing and the retain statement.

    data Comparison_Dates;
        set have;
        by brand date;
    
        retain Product_Year_Start Product_Year_End;
    
        if(first.brand) then do;
            Product_Year_Start = date;
            Product_Year_End = intnx('year', date, 1, 'S') - 1;
        end;
    
        if(Date > Product_Year_End) then do;
            Product_Year_Start = intnx('year', Product_Year_Start, 1, 'S');
            Product_Year_End = intnx('year', Product_Year_End, 1, 'S');
        end;
    
        format Product_Year_Start Product_Year_End date9.;
    run;
    

    Step 3: Using the original SQL code, group instead by the new product start/end dates

    proc sql;
        create table want as
        select catt(year(Product_Year_Start), '-', year(Product_Year_End) ) as Product_Year
             , Brand
             , sum(Units_Sold) as Unit_per_Year
        from Comparison_Dates
        group by Brand, calculated Product_Year
        order by Brand, calculated Product_Year;
    quit;