Search code examples
rmatchinglinkagerecord-linkage

Deduplicate after probabilistic linkage R


I just performed probabilistic linkage on two datasets. The output dataset called "data", contains the identification number from both original datasets, ID_A, the other ID_B, with a linkage score "match_score".

ID_A<-c('123','124','125','125','126','127','127','128','129')
ID_B<-c('777','778','787','799','762','762','777','999','781')
Match_score<-c(28.1,15.6,19.7,18.9,36.1,55.1,28.7,19.5,18.2)
data<-data.frame(ID_A,ID_B,Match_score)

There are numerous combinations of ID_A and ID_B. I want to select only the top linkage to pair then remove them from the selection process for further linkages. An ideal output would be...

ID_A     ID_B     Match_score 
127       762       55.1
123       777       28.1
125       787       19.7
128       999       19.5
129       781       18.2
124       778       15.6

ID_A: 126 wouldn't match because of the ID_B (762), match_score is higher for another ID_A (127).

ID_B: 799 wouldn't match because ID_A(125) had a larger match_score with (787)

Any help would be greatly appreciated!

I have the solution to my problem in SAS, however I am having difficulty converting to R.

proc sort data=one;
  by descending match_score ID_A ID_B;
run;

data want;
 if _n_=1 then do;
  dcl hash ha();
  ha.definekey('ID_A');
  ha.definedone();
  dcl hash hb();
  hb.definekey('ID_B');
  hb.definedone();
 end;
set one;
if ha.check()*hb.check() then do;
 output;
 ha.add();
 hb.add();
end;
run;

Solution

  • I tried to follow your logic. Even if the codes below look a little bit messy, I think this is one solution by using base R.

    map_A <- data[duplicated(data$ID_A),]$ID_A
    
    for(i in map_A) {
        temp <- data[data$ID_A== i,]
        index <- row.names(temp[which.min(temp$Match_score),])
        data <- data[row.names(data)!= index,]
    }
    
    map_B <-data[duplicated(data$ID_B),]$ID_B
    
    for(i in map_B) {
        temp <- data[data$ID_B== i,]
        index <- row.names(temp[which.min(temp$Match_score),])
        data <- data[row.names(data)!= index,]
    }
    data[order(-data$Match_score),]
    

    gives,

      ID_A ID_B Match_score
      127  762        55.1
      123  777        28.1
      125  787        19.7
      128  999        19.5
      129  781        18.2
      124  778        15.6