Search code examples
rduplicatesrecordlinkagedata-linking

How to Find Record Matches Using R's RecordLinkage package?


Am relatively new to data linkage in general and the R RecordLinkage package in particular. I have data like below:

require(RecordLinkage)
library(RCurl)

dss_member <- read.csv(text = getURL("https://raw.githubusercontent.com/kilimba/data/master/dss_member.csv"),
                       stringsAsFactors = F)
dss_member$id <- NULL
patient <- read.csv(text = getURL("https://raw.githubusercontent.com/kilimba/data/master/patient.csv"),
                    stringsAsFactors = F)
patient$id <- NULL

rpairs <- compare.linkage(patient,dss_member)

rpairs$pairs

rpairs <- epiWeights(rpairs) 

summary(rpairs)

as you can see I have two data frames, dss_member (11 rows) and patient (5 rows). I have inserted a row in both which should in theory definitely be a link, the user James Earl Jones. However I have 2 concerns.

  1. The line rpairs$pairs results in output where the last column is_match always shows as NA, even though I am sure of at least one row being identical in both datasets. What does this mean? This is related to another SO question which is yet to be answered.

  2. The lines

    rpairs <- epiWeights(rpairs)

    summary(rpairs)

give a result as following:

Linkage Data Set

5 records in data set 1 
11 records in data set 2 
55 record pairs 

0 matches
0 non-matches
55 pairs with unknown status


Weight distribution:

  [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]   (0.8,1] 
       47         1         3         2         2 

(a) Why does it show 0 matches and 0 non-matches, when there is definitely at least on match (James Earl Jones)

(b) Is the identity argument in the function compare.linkage() optional? and if so, what happens when you leave it out vs putting it in?

(c) Can one use this package even in the absence of a "Gold Standard" to perform record linkage, and not record linkage evaluation?

Kind regards, Tumaini


Solution

  • Tumaini,

    You need to distinguish between true status (false or true) and classification (non-link, possible, or link). See the authors' article in R Journal 2/2 (2010), the manual for the package, and the authors' response here: R RecordLinkage Identity .

    To answer your questions directly:

    (a) The output shows "0 matches" and "0 non-matches" because you omitted the identity1 and identity2 arguments in compare.linkage().

    (b) Yes, the identity1 and identitity2 arguments in compare.linkage() are optional. If you omit the identity arguments, then you ignore true match status. If you specify the identity arguments correctly, then the true match status is used.

    (c) I am not sure what you mean by "record linkage" versus "record linkage evaluation". Record linkage can be understood as a classification problem with the comparison pattern as input and the matching status as output.

    Here is a four-step solution you may want to try:

    1) Run compare.linkage without identity arguments.

    2) Create two identity variables from the record pairs.

    3) Convert the two identity variables into identity vectors.

    4) Run compare.linkage again but with identity arguments.

    Anders Alexandersson andersalex@gmail.com