Okay this seems like a very simple thing, but I can't explain what a "by statement" in a sas datastep is really doing. I know when I need to use it, but I am not sure what it is doing.
In the example below I understand what the virtual sas column of first.var and last.var is when it has the values that it does. Is the by statement creating these virtual columns around the var initial and metal? Then sas is scanning the entire data set once?
data jewelers ;
input id initial $ metal $ ;
datalines;
456 D Gold
456 D Silver
123 L Gold
123 L Copper
123 L PLatinum
567 R Gold
567 R Gold
567 R Gold
345 A Platinum
345 A Silver
345 A Silver
;
proc sort ;
by initial metal ;
run;
data dups;
set jewelers ;
by initial metal ;
if not (first.metal and last.metal) then output;
run;
if I proc print dups I expect this:
567 R Gold
567 R Gold
567 R Gold
345 A Silver
345 A Silver
As you've acknowledged, SAS is creating the automatic variables first.byvar
and last.byvar
for each byvar
in your by
statement. Records read in from a set
statement are held in a buffer before SAS moves them to the PDV (Program Data Vector - where data step logic is executed on each row before it is output), so SAS can look ahead to the next record in the buffer to see whether any of the byvars have changed, and set last.byvar = 1
for the row currently in the PDV.
The only difference I can see between what you say you expect and what you get in the dups dataset is the order of the records - because you've sorted by initial and then metal, the A Silver
records are sorted ahead of the R Gold
records.
If you want to get duplicates across these two variables while preserving the original row order, you'd need to make a note of what the original record order was and sort your duplicates dataset back into that order after your second data step.