Search code examples
saswhere-clauseproc-sql

How is the SAS proc sql create table (where=()) option different to a where clause in the sql query and when should it be used?


From what I can tell both example code blocks will produce the same output, but the in the first statement the where is applied after the query inside is completed whereas in the second statement the where is applied during the running of the statement.

I am correct in thinking the second option will be more performant due to the filtering occurring earlier and reducing the data in the join? When would this "where" option for create table be the preferred option?

proc sql;
create table (where=(field1=1))
select a.field1, a.field2
from table1 a
join table2 b
on a.field1 = b.field1;
quit;
proc sql;
create table
select a.field1, a.field2
from table1 a
join table2 b
on a.field1 = b.field1;
where a.field1 = 1;
quit;

Solution

  • Syntax

    The data set option (where=(...)) can only put conditions on that table.

    select man.name, woman.name
    from man(age > 60), woman(age between 50 and 60)
    where man.address = woman.address;
    

    The SQL where clause applies to all tables involved in the query, hence if two tables have fields with identical names you have to qualify that name.

    select man.name, woman.name
    from man, woman
    where man.address = woman.address
      and man.age > 60
      and woman.age between 50 and 60;
    

    Processing

    Logically, the dataset option is applied to the data table before it is involved in the query and the result of that condition is stored in a temporary table, which could be huge. This would mean the dataset option is slower.

    In practice however, all SQL goes through an "optimizer" before it is applied to the data.

    • If you query sas Base tables, that optimizer is from SAS.
    • If you query a database, that optimizer is from the database.

    So what is more efficient, depends on the library engine.

    A particular case is, when one table is on a database and another one in SAS base, or both tables are in different databases. Then the data of each table is retrieved first, then joined in SAS. It might wel be, that then a condition in the dataset options is already applied on the database and a condition in the where clause is only applied in SAS. This would mean the dataset option is faster;