Lets suppose we have the following table ("Purchases"):
Date Units_Sold Brand Year
18/03/2010 5 A 2010
12/04/2010 2 A 2010
22/05/2010 1 A 2010
25/05/2010 7 A 2010
11/08/2011 5 A 2011
12/07/2010 2 B 2010
22/10/2010 1 B 2010
05/05/2011 7 B 2011
And the same logic continues until the end of 2014, for different brands.
What I want to do is calculate the number of Units_Sold for every Brand, in each year. However, I don't want to do it for the calendar year, but for the actual year.
So an example of what I don't want:
proc sql;
create table Dont_Want as
select Year, Brand, sum(Units_Sold) as Unit_per_Year
from Purchases
group by Year, Brand;
quit;
The above logic is ok if we know that e.g. Brand "A" exists throughout the whole 2010. But if Brand "A" appeared on 18/03/2010 for the first time, and exists until now, then a comparison of Years 2010 and 2011 would not be good enough as for 2010 we are "lacking" 3 months.
So what I want to do is calculate:
for A: the sum from 18/03/2010 until 17/03/2011, then from 18/03/2011 until 17/03/2012, etc.
for B: the sum from 12/07/2010 until 11/07/2011, etc.
and so on for all Brands.
Is there a smart way of doing this?
Step 1: Make sure your dataset is sorted or indexed by Brand and Date
proc sort data=want;
by brand date;
run;
Step 2: Calculate the start/end dates for each product
The idea behind the below code:
We know that the first occurrence of the brand in the sorted dataset is the day in which the brand was introduced. We'll call this Product_Year_Start
.
The intnx
function can be used to increment that date by 365 days, then subtract 1 from it. Let's call this date Product_Year_End
.
Since we now know the product's year end date, we know that if the date on any given row exceeds the product's year end date, we have started the next product year. We'll just take the calculated Product_Year_End
and Product_Year_Start
for that brand and bump them up by one year.
This is all achieved using by-group processing and the retain
statement.
data Comparison_Dates;
set have;
by brand date;
retain Product_Year_Start Product_Year_End;
if(first.brand) then do;
Product_Year_Start = date;
Product_Year_End = intnx('year', date, 1, 'S') - 1;
end;
if(Date > Product_Year_End) then do;
Product_Year_Start = intnx('year', Product_Year_Start, 1, 'S');
Product_Year_End = intnx('year', Product_Year_End, 1, 'S');
end;
format Product_Year_Start Product_Year_End date9.;
run;
Step 3: Using the original SQL code, group instead by the new product start/end dates
proc sql;
create table want as
select catt(year(Product_Year_Start), '-', year(Product_Year_End) ) as Product_Year
, Brand
, sum(Units_Sold) as Unit_per_Year
from Comparison_Dates
group by Brand, calculated Product_Year
order by Brand, calculated Product_Year;
quit;