Am relatively new to data linkage in general and the R RecordLinkage package in particular. I have data like below:
require(RecordLinkage)
library(RCurl)
dss_member <- read.csv(text = getURL("https://raw.githubusercontent.com/kilimba/data/master/dss_member.csv"),
stringsAsFactors = F)
dss_member$id <- NULL
patient <- read.csv(text = getURL("https://raw.githubusercontent.com/kilimba/data/master/patient.csv"),
stringsAsFactors = F)
patient$id <- NULL
rpairs <- compare.linkage(patient,dss_member)
rpairs$pairs
rpairs <- epiWeights(rpairs)
summary(rpairs)
as you can see I have two data frames, dss_member
(11 rows) and patient
(5 rows). I have inserted a row in both which should in theory definitely be a link, the user James Earl Jones. However I have 2 concerns.
The line rpairs$pairs
results in output where the last column is_match
always shows as NA, even though I am sure of at least one row being identical in both datasets. What does this mean? This is related to another SO question which is yet to be answered.
The lines
rpairs <- epiWeights(rpairs)
summary(rpairs)
give a result as following:
Linkage Data Set
5 records in data set 1
11 records in data set 2
55 record pairs
0 matches
0 non-matches
55 pairs with unknown status
Weight distribution:
[0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
47 1 3 2 2
(a) Why does it show 0 matches and 0 non-matches, when there is definitely at least on match (James Earl Jones)
(b) Is the identity
argument in the function compare.linkage()
optional? and if so, what happens when you leave it out vs putting it in?
(c) Can one use this package even in the absence of a "Gold Standard" to perform record linkage, and not record linkage evaluation?
Kind regards, Tumaini
Tumaini,
You need to distinguish between true status (false or true) and classification (non-link, possible, or link). See the authors' article in R Journal 2/2 (2010), the manual for the package, and the authors' response here: R RecordLinkage Identity .
To answer your questions directly:
(a) The output shows "0 matches" and "0 non-matches" because you omitted the identity1 and identity2 arguments in compare.linkage().
(b) Yes, the identity1 and identitity2 arguments in compare.linkage() are optional. If you omit the identity arguments, then you ignore true match status. If you specify the identity arguments correctly, then the true match status is used.
(c) I am not sure what you mean by "record linkage" versus "record linkage evaluation". Record linkage can be understood as a classification problem with the comparison pattern as input and the matching status as output.
Here is a four-step solution you may want to try:
1) Run compare.linkage without identity arguments.
2) Create two identity variables from the record pairs.
3) Convert the two identity variables into identity vectors.
4) Run compare.linkage again but with identity arguments.
Anders Alexandersson andersalex@gmail.com