I have two data sets that look like this:
data sales;
format week date9.;
input store $ week date9. sales;
cards;
A 01JAN2014 200
A 08JAN2014 500
A 22JAN2014 200
B 01JAN2014 100
B 08JAN2014 200
B 15JAN2014 200
;
run;
data dates;
format week date9.;
input week date9.;
cards;
01JAN2014
08JAN2014
15JAN2014
22JAN2014
29JAN2014
;
run;
These are small examples of very large tables in a database. I want to join them using a proc sql step, so that all dates are present for each location (with missing values if no sales). I've been creating a cartesian join of the stores and weeks to produce the desired result, but this method has proven to be extremely inefficient when I attempt to query large amounts of data. I know there is a better way to do this, but cannot get it. Below is my attempt...to clarify, the full_sales data set IS the desired output, I just need a more efficient way to do it. Thanks.
proc sql;
create table cartesian as
select distinct sales.store, dates.week
from sales, dates
order by 1,2;
quit;
proc sql;
create table full_sales as
select cartesian.store,
cartesian.week,
sales.sales
from cartesian
left join sales
on cartesian.store=sales.store and
cartesian.week=sales.week
order by 1,2
;
quit;
I would approach this as:
proc sql;
create table full_sales as
select s.store, w.week, ssa.sales
from (select distinct store from sales) s cross join
dates w left join
sales sa
on s.store = sa.store and
w.week = sa.week
order by 1,2
;
quit;
This does not require the auxiliary table cartesian
, which might be faster. An index on sales(store, week)
would definitely speed the query.