Search code examples
sasdatastep

SAS - Row by row Comparison within different ID Variables of Same Dataset and delete ALL Duplicates


I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.

That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.

I want to identify any duplicates between rows of different groups and delete the observation entirely.

For example:

ID  Value
 1    A
 1    B
 1    C
 1    D
 1    D
 2    A
 2    C
 3    A
 3    Z
 3    B

The output I desire is:

ID  Value
 1    D
 3    Z

I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.

The flagging code is:

data have;
set want;
flag = first.ID ne last.ID;
run;

This worked for some cases, but I also got duplicates within the same value group flagged.

Therefore the first observation got deleted:

ID  Value
 3    Z

I also tried:

data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;

but that didn't mark any duplicates at all.

I would appreciate any help. Please let me know if any other information is required.

Thanks.


Solution

  • Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.

    data have;
    input ID  Value $;
    cards;
     1    A
     1    B
     1    C
     1    D
     1    D
     2    A
     2    C
     3    A
     3    Z
     3    B
     ;
    run;
    
    proc sort data = have nodupkey;
        by value ID;
    run;
    
    data want;
    set have;
    by value;
    if first.value and last.value;
    run;
    

    proc sql version:

    proc sql;
    create table want as
    select distinct ID, value from have
    group by value
    having count(distinct id) =1
    order by id
    ;
    quit;