From what I can tell both example code blocks will produce the same output, but the in the first statement the where is applied after the query inside is completed whereas in the second statement the where is applied during the running of the statement.
I am correct in thinking the second option will be more performant due to the filtering occurring earlier and reducing the data in the join? When would this "where" option for create table be the preferred option?
proc sql;
create table (where=(field1=1))
select a.field1, a.field2
from table1 a
join table2 b
on a.field1 = b.field1;
quit;
proc sql;
create table
select a.field1, a.field2
from table1 a
join table2 b
on a.field1 = b.field1;
where a.field1 = 1;
quit;
The data set option (where=(...))
can only put conditions on that table.
select man.name, woman.name
from man(age > 60), woman(age between 50 and 60)
where man.address = woman.address;
The SQL where
clause applies to all tables involved in the query, hence if two tables have fields with identical names you have to qualify that name.
select man.name, woman.name
from man, woman
where man.address = woman.address
and man.age > 60
and woman.age between 50 and 60;
Logically, the dataset option is applied to the data table before it is involved in the query and the result of that condition is stored in a temporary table, which could be huge. This would mean the dataset option is slower.
In practice however, all SQL goes through an "optimizer" before it is applied to the data.
So what is more efficient, depends on the library engine.
A particular case is, when one table is on a database and another one in SAS base, or both tables are in different databases. Then the data of each table is retrieved first, then joined in SAS. It might wel be, that then a condition in the dataset options is already applied on the database and a condition in the where clause is only applied in SAS. This would mean the dataset option is faster;