Search code examples
sortingsasno-duplicates

SAS: Proc sort nodupkeys error


I have two data sets :

  • "mothers" - 5,512 observations where the variable "MOTHER" = 1
  • "all_women" - 2,336,750 observations where the variable "MOTHER" = 0

I combine the two as follows:

data combined;
set mothers all_women;

Now as the mothers are already in the datset all_women, I want to delete the repeated entries - with the condition that I keep the observations where "MOTHER"=1.

I tried the following:

proc sort data=combined; by ID DESCENDING MOTHER; run;

proc sort data=combined nodupkeys; by ID; run;

yet I lose some of the mothers because I am left with only 5458 observations where "MOTHER"=1. What have I done to introduce this error?


Solution

  • Instead of using NODUPKEY, use FIRST./LAST. processing.

    proc sort data=combined; 
      by ID DESCENDING MOTHER; 
    run;
    
    data want;
      set combined;
      by ID descending mother;
      if not (first.ID) and (mother=0) then delete;
    run;
    

    That would keep any ID that had mother=0 only (keep 1 record per), and would keep all mother=1 rows.