I'm attempting to use the R package RecordLinkage to find duplicate entries between one dataframe that has 74,000 entries and one with roughly 350,000 entries. I've generated an object, rpairs using RLBigDataLinkage but can't get it to complete the weighting bit. The error it's spitting out is:
Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, :
no diskspace
Here is the code:
Missing <- data.frame(Missing$fulladdr, Missing$zip, Missing$XCOORD, Missing$YCOORD)
Missing <- rename(Missing, c("Missing.fulladdr"="addr", "Missing.zip"="zip", "Missing.XCOORD"="X", "Missing.YCOORD"="Y"))
samlink <- data.frame(sam$fulladdr, sam$zip, sam$COB.SAM.Longitude, sam$COB.SAM.Latitude)
samlink <- rename(samlink, c("sam.fulladdr"="addr", "sam.zip"="zip", "sam.COB.SAM.Latitude"="Y", "sam.COB.SAM.Longitude"="X"))
rpairs <- RLBigDataLinkage(dataset1 = samlink, dataset2 = Missing,
blockfld = c(2), strcmp = c(1), strcmpfun = "jarowinkler")
rpairs_em <- emWeights(rpairs)
It turns out this is the result of R creating a massive file in the Temp folder and thus eating up what limited space I had on my HD. The best way I found to address this is to increase the number of variables one would block by. Within the code I changed blockfld = c(2)
to blockfld = c(2:4)
Of course, this only works if this blocking setup makes sense for the data.