Search code examples
sasproc-sql

Optimize proc sql statements in SAS


I'm very new to SAS and trying to learn it. I have a problem statement where I need to extract two files from a location and then perform joins. Below is a detailed explanation of what I'm trying to achieve in a single proc sql statement:

  1. There are two tables, table a (columns - account#, sales, transaction, store#) and table b (columns - account#, account zipcode) and an excel file (columns - store# and store zipcode). I need to first join these two tables on column account#.
  2. Next step is to join their resulting values with the excel file on column store# and also add a column called as 'distance', which calculates the distance between account zipcode and store zipcode with the help of zipcitydistance(account zipcode, store zipcode) function. Let the resulting table be called "F".
  3. Next I want to use case statement to create a column of distance bucket based on the distance from above query, for e.g., select case when distance<=5 then "<=5" when distance between 5 and 10 then "5-10" when distance between 10 and 15 then "10-15" else ">=15" end as distance_bucket, sum(transactions) as total_txn, sum(sales) as total_sales, from F group by 1

So far, below is the code that I have written:


data table_a
set xyzstore.filea;
run;

data table_b
set xyzstore.fileb;
run;

proc import datafile="/location/file.xlsx"
out=filec dbms=xlsx replace;
run;

proc sql;
create table d as
select a.store_number, b.account_number, sum(a.sales) as sales, sum(a.transactions) as transactions, b.account_zipcode
from table_a left join table_b
on a.account_number=b.account_number
group by a.store_number, b.account_number, b.account_zipcode;
quit;

proc sql;
create table e as
select d.*, c.store_zipcode, zipcitydistance(table_d.account_zipcode, c.store_zipcode) as distance
from d inner join filec as c
on d.store_number=c.store_number;
quit;

proc sql;
create table final as
select 
case 
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from e
group by 1;
quit;

How can I write the above lines of code in a single proc sql statement?


Solution

  • The way you are currently doing it is more readable and the preferred way to do it. Turning it into a single SQL statement will not yield any significant performance gains and will make it harder to troubleshoot in the future.

    To do a little cleanup, you can remove the two data step set statements and join directly on those files themselves:

    create table d as
        ...
        from xyzstore.filea left join xystore.fileb
        ...
    quit;
    

    You could also use a format instead to clean up the CASE statement.

    proc format;
        value storedistance 
            low - 5    = '<=5'
            5< - 10    = '5-10'
            10< - 15   = '10-15'
            15  - high = '>=15'
            other = ' '
        ;
    run;
    
    ...
    
    proc sql;
        create table final as
            select put(distance, storedistance.) as distance_bucket
                 , sum(transactions) as total_txn
                 , sum(sales) as total_sales
            from e
            group by calculated distance_bucket
        ;
    quit;
    

    If you did want to turn your existing code into one big SQL statement, it would look like this:

    proc sql;
        create table final as
            select CASE
                        when(distance <= 5) then '<=5'
                        when(distance between 5 and 10) then '5-10'
                        when(distance between 10 and 15) then '10-15'
                        else '>=15'
                   END as distance_bucket
                 , sum(transactions) as total_txn
                 , sum(sales) as total_sales
    
            /* Join 'table d' with c */
            from (select d.*
                       , c.store_zipocde
                       , zipcitydistance(d.account_zipcode, c.store_zipcode) as distance
    
                  /* Create 'table d' */
                  from (select a.store_number
                             , b.account_number
                             , sum(a.sales) as sales
                             , sum(a.transactions) as transactions
                             , b.account_zipcode
                       from xyzstore.filea as a
                       LEFT JOIN
                            xyzstore.fileb as b
                       ON a.account_number = b.account_number
                       group by a.store_number
                              , b.account_number
                              , b.account_zipcode
                       ) as d
                  INNER JOIN    
                       filec as c
                 )
             group by calculated distance_bucket
         ;
    quit;
    

    While more compact, it is more difficult to troubleshoot. You lose those in-between steps that can identify if there's an issue with the data. Suppose the store distances look incorrect one day: you'd need to unpack all of those SQL statements, put them into individual PROC SQL blocks and run them. Every time you run into a problem you will need to do this. If you have them separated out, you'll use a negligible amount of temporary disk space and have a much easier time troubleshooting.

    When dealing with raw data, especially data that updates regularly, assume something will go wrong one day and you'll need to review it in-depth. Sometimes the wrong file gets sent. Sometimes an upstream issue occurs that sends corrupted data. Any time that happens, you'll need to dig in and find out if it's a problem with your process or their process. Making easy-to-troubleshoot code will speed up the solution for everyone.